JSOUP 从重定向 link 中获取 html 内容
JSOUP get the html content from redirected link
考虑以下 url http://www.google.com/url?rct=j&sa=t&url=http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings&ct=ga&cd=CAIyHWU3NmVhMGQ0NWQ3MmRmY2I6Y29tOmVuOlVTOlJM&usg=AFQjCNE_8XwECqkmyPIMzcSxCDh2hP16wQ. When i pass this url
to JSOUP
, the html content is not accurate. But when i open this url in browser, it will rediect to http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings.
然后,我将此 url
传递给 jsoup
,现在我得到了准确的 html 内容。
如何从第一个 url
中获取准确的 html 内容??
我试过很多选择
Response response = Jsoup.connect(url).followRedirects(true).timeout(timeOut*1000).userAgent(userAgent).execute();
int status = response.statusCode();
if (status == HttpURLConnection.HTTP_MOVED_TEMP || status == HttpURLConnection.HTTP_MOVED_PERM || status == HttpURLConnection.HTTP_SEE_OTHER) {
redirectUrl = response.header("location");
response = Jsoup.connect(redirectUrl).followRedirects(false).timeout(timeOut*1000).userAgent(userAgent).execute();
}
Document doc=response.parse();
我尝试了很多 user agents
、.referrer("http://google.com")
选项等。
我目前使用的是 jsoup
版本 1.8.3.
Google returns 带有 JavaScript/META 重定向的 html 页面:
<script>window.googleJavaScriptRedirect=1</script><script>var n={navigateTo:function(b,a,d){if(b!=a&&b.google){if(b.google.r){b.google.r=0;b.location.href=d;a.location.replace("about:blank");}}else{a.location.replace(d);}}};n.navigateTo(window.parent,window,"http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings");
</script><noscript><META http-equiv="refresh" content="0;URL='http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings'"></noscript>
这与 HTTP 重定向 headers 不同,因为 Jsoup 不解释 JavaScript 你运气不好。
但是,您当然可以解析它以获得真正的 link。这当然已经可以在不访问 Google 的情况下实现,因为 link 是原始 URL.
中参数的一部分
考虑以下 url http://www.google.com/url?rct=j&sa=t&url=http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings&ct=ga&cd=CAIyHWU3NmVhMGQ0NWQ3MmRmY2I6Y29tOmVuOlVTOlJM&usg=AFQjCNE_8XwECqkmyPIMzcSxCDh2hP16wQ. When i pass this url
to JSOUP
, the html content is not accurate. But when i open this url in browser, it will rediect to http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings.
然后,我将此 url
传递给 jsoup
,现在我得到了准确的 html 内容。
如何从第一个 url
中获取准确的 html 内容??
我试过很多选择
Response response = Jsoup.connect(url).followRedirects(true).timeout(timeOut*1000).userAgent(userAgent).execute();
int status = response.statusCode();
if (status == HttpURLConnection.HTTP_MOVED_TEMP || status == HttpURLConnection.HTTP_MOVED_PERM || status == HttpURLConnection.HTTP_SEE_OTHER) {
redirectUrl = response.header("location");
response = Jsoup.connect(redirectUrl).followRedirects(false).timeout(timeOut*1000).userAgent(userAgent).execute();
}
Document doc=response.parse();
我尝试了很多 user agents
、.referrer("http://google.com")
选项等。
我目前使用的是 jsoup
版本 1.8.3.
Google returns 带有 JavaScript/META 重定向的 html 页面:
<script>window.googleJavaScriptRedirect=1</script><script>var n={navigateTo:function(b,a,d){if(b!=a&&b.google){if(b.google.r){b.google.r=0;b.location.href=d;a.location.replace("about:blank");}}else{a.location.replace(d);}}};n.navigateTo(window.parent,window,"http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings");
</script><noscript><META http-equiv="refresh" content="0;URL='http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings'"></noscript>
这与 HTTP 重定向 headers 不同,因为 Jsoup 不解释 JavaScript 你运气不好。
但是,您当然可以解析它以获得真正的 link。这当然已经可以在不访问 Google 的情况下实现,因为 link 是原始 URL.
中参数的一部分