HTTP 解析器:抓取单页应用程序:许多 GET,如何找出页面何时结束
HTTP parser: scraping single page application: many GETs, how to find out when the page ends
我正在尝试解析此站点:
https://www.monster.com/jobs/search/?q=java&where=usa&stpage=1
从本质上讲,它并不复杂:它是一个单页应用程序,您给它关键字,单击搜索,然后它会显示结果——它开始时只显示大约 29 个结果。向下滚动时,会加载新结果。
在加载新结果之前,它向
发送一个 GET 请求
https://www.monster.com/jobs/search/pagination/?q=java&where=usa&isDynamicPage=true&isMKPagination=true&page=2&total=26
这将导致 JSON
回复,这是一个职位列表,看起来有点像这样:
{"Title":"Java Developer","TitleLink":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","DatePostedText":"6 days ago","DatePosted":"2020-01-18T12:00","LocationText":"Orlando, FL, 32801","JobViewUrl":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","ImpressionTracking":"data-m_impr_uuid=\"a7320356-70db-46ca-908e-e540f0e74cec\" data-m_impr_a_placement_id=\"JSR2CW\" data-m_impr_s_t=\"t\" data-m_impr_j_p=\"27\" data-m_impr_j_jpm=\"1\" data-m_impr_j_lat=\"28.5418\" data-m_impr_j_long=\"-81.3736\" data-m_impr_j_jawsid=\"418397617\" data-m_impr_j_postingid=\"b55f4409-3858-483a-a2e9-65e254ec1cd2\" data-m_impr_j_jobid=\"215193478\" data-m_impr_j_cid=\"660\" data-m_impr_j_occid=\"11970\" data-m_impr_j_lid=\"385\" data-m_impr_j_jpt=\"1\" data-m_impr_j_pvc=\"monster\" data-m_impr_j_coc=\"xsummittechx\" ","Company":{"Name":"Summit Technologies","HasCompanyAddress":true,"LogoLink":""},"Text":"Java Developer","ApplyType":"ApplyOnline","IsAggregated":"false","JobViewUrlMeta":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","MusangKingId":"215193478","CompanyLogoUrl":"","PrivateBoardIconImageUrl":"","FitIcon":"","FitIconType":""}
另一个 POST 请求被发送到
https://ib.adnxs.com/ut/v3
(v3 请求):
其中 tag_id: 14162549
的值 14162549
似乎取自上述 GET 请求。
因此,当您向下滚动时,它会发送 1 个 GET 和 1 个 POST 请求,直到它不再发送 - 滚动结束,请求也结束:
我不明白它是如何确定何时停止的。
我想抓取这些工作,我可以做一些事情,比如发送 GET 到
https://www.monster.com/jobs/search/pagination/?q=java&where=usa&isDynamicPage=true&isMKPagination=true&page=N
但我不知道什么时候停止,因为如果说,它在 &page=12
时停止滚动,如果我向 &page=13
发送请求,它不会 return一个空的 JSON,相反,它会显示一些其他职位(可能不太相关,因此在滚动到底部时不可见)。
我使用okHttp
发送请求,像这样:
HttpUrl.Builder urlBuilder = HttpUrl.parse(getUrl()).newBuilder();
urlBuilder.addQueryParameter("page", "1");
String url = urlBuilder.build().toString();
Request request = new Request.Builder()
.url(url)
.addHeader("Content-Type", "application/json; charset=utf-8")
.addHeader("Accept-Language", Locale.US.getLanguage())
.build();
OkHttpClient client = new OkHttpClient();
Call call = client.newCall(request);
Response response = call.execute();
String responseBody = response.body().string();
System.out.println(responseBody);
Gson gson = new Gson();
List<MonsterJobJson> resultMonster = gson.fromJson(
responseBody, new TypeToken<List<MonsterJobJson>>() {
}.getType());
信誉不足,无法发表评论。
您可能会看看 div.mux-search-results
。它似乎有一些属性描述如何加载更多结果,以及每页显示的结果总数和总数。下面列出了一些看似相关的属性;
data-results-page="1"
Data-results-url="https://www.monster.com/jobs/search/pagination/?q=java&where=usa&stpage=1&isDynamicPage=true&isMKPagination=true"
data-results-per-page="25"
data-results-total="250"
data-total-search-results="61503"
data-results-max="250"
我正在尝试解析此站点:
https://www.monster.com/jobs/search/?q=java&where=usa&stpage=1
从本质上讲,它并不复杂:它是一个单页应用程序,您给它关键字,单击搜索,然后它会显示结果——它开始时只显示大约 29 个结果。向下滚动时,会加载新结果。
在加载新结果之前,它向
发送一个 GET 请求https://www.monster.com/jobs/search/pagination/?q=java&where=usa&isDynamicPage=true&isMKPagination=true&page=2&total=26
这将导致 JSON
回复,这是一个职位列表,看起来有点像这样:
{"Title":"Java Developer","TitleLink":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","DatePostedText":"6 days ago","DatePosted":"2020-01-18T12:00","LocationText":"Orlando, FL, 32801","JobViewUrl":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","ImpressionTracking":"data-m_impr_uuid=\"a7320356-70db-46ca-908e-e540f0e74cec\" data-m_impr_a_placement_id=\"JSR2CW\" data-m_impr_s_t=\"t\" data-m_impr_j_p=\"27\" data-m_impr_j_jpm=\"1\" data-m_impr_j_lat=\"28.5418\" data-m_impr_j_long=\"-81.3736\" data-m_impr_j_jawsid=\"418397617\" data-m_impr_j_postingid=\"b55f4409-3858-483a-a2e9-65e254ec1cd2\" data-m_impr_j_jobid=\"215193478\" data-m_impr_j_cid=\"660\" data-m_impr_j_occid=\"11970\" data-m_impr_j_lid=\"385\" data-m_impr_j_jpt=\"1\" data-m_impr_j_pvc=\"monster\" data-m_impr_j_coc=\"xsummittechx\" ","Company":{"Name":"Summit Technologies","HasCompanyAddress":true,"LogoLink":""},"Text":"Java Developer","ApplyType":"ApplyOnline","IsAggregated":"false","JobViewUrlMeta":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","MusangKingId":"215193478","CompanyLogoUrl":"","PrivateBoardIconImageUrl":"","FitIcon":"","FitIconType":""}
另一个 POST 请求被发送到
https://ib.adnxs.com/ut/v3
(v3 请求):
其中 tag_id: 14162549
的值 14162549
似乎取自上述 GET 请求。
因此,当您向下滚动时,它会发送 1 个 GET 和 1 个 POST 请求,直到它不再发送 - 滚动结束,请求也结束:
我不明白它是如何确定何时停止的。
我想抓取这些工作,我可以做一些事情,比如发送 GET 到
https://www.monster.com/jobs/search/pagination/?q=java&where=usa&isDynamicPage=true&isMKPagination=true&page=N
但我不知道什么时候停止,因为如果说,它在 &page=12
时停止滚动,如果我向 &page=13
发送请求,它不会 return一个空的 JSON,相反,它会显示一些其他职位(可能不太相关,因此在滚动到底部时不可见)。
我使用okHttp
发送请求,像这样:
HttpUrl.Builder urlBuilder = HttpUrl.parse(getUrl()).newBuilder();
urlBuilder.addQueryParameter("page", "1");
String url = urlBuilder.build().toString();
Request request = new Request.Builder()
.url(url)
.addHeader("Content-Type", "application/json; charset=utf-8")
.addHeader("Accept-Language", Locale.US.getLanguage())
.build();
OkHttpClient client = new OkHttpClient();
Call call = client.newCall(request);
Response response = call.execute();
String responseBody = response.body().string();
System.out.println(responseBody);
Gson gson = new Gson();
List<MonsterJobJson> resultMonster = gson.fromJson(
responseBody, new TypeToken<List<MonsterJobJson>>() {
}.getType());
信誉不足,无法发表评论。
您可能会看看 div.mux-search-results
。它似乎有一些属性描述如何加载更多结果,以及每页显示的结果总数和总数。下面列出了一些看似相关的属性;
data-results-page="1"
Data-results-url="https://www.monster.com/jobs/search/pagination/?q=java&where=usa&stpage=1&isDynamicPage=true&isMKPagination=true"
data-results-per-page="25"
data-results-total="250"
data-total-search-results="61503"
data-results-max="250"