Jsoup 响应不像浏览器检查
Jsoup response not like browser inspect
我想用 jsoup 解析网页。但其返回的 html 与浏览器检查不同。当我从浏览器转到网页时,我可以在 <section id='js_item_list_section'>......</section>
下看到 <ol>
标签。但是如果我在 spring 引导项目中使用 jsoup 调用网页,我在该部分下看不到 <ol>
标签。 section 下还有一个<div key="">
。返回的回复如下:
JSOUP 响应:
<section id="js_item_list_section" class="item-list item-list--loading clearfix">
<div key="itemlist-loader" class="ellipsis-loader-wrapper ellipsis-loader-wrapper--text ellipsis-loader-wrapper--top">
<div class="ellipsis-loader ellipsis-loader--branded center-x">
<div class="ellipsis-loader__dot"></div>
<div class="ellipsis-loader__dot"></div>
<div class="ellipsis-loader__dot"></div>
</div>
<span class="loader-text center-x">Yükleniyor</span>
</div>
</section>
Web 浏览器 (Chrome) 检查员:
<section id="js_item_list_section" class="item-list clearfix">
<ol>
<li>.....</li>
<li>.....</li>
<ol>
</section>
我想大概是React.js。
还有我的代码块:
Document document = Jsoup.connect(myUrl)
.ignoreContentType(true)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36")
.get();
Element itemListSection = document.getElementById("js_item_list_section");
问题很可能是您尝试解析的页面包含动态生成的内容(js_item_list_section 已经暗示 Java用于呈现此内容的脚本)
JSoup 不解释 JavaScript,因此也不会加载通过 AJAX 调用访问的内容。所以不幸的是,JSoup 不能按照你想要的方式使用。
我看到你有两个选择:
1) 使用像 selenium web driver 这样的工具,它从 Java 控制真实的浏览器,还允许解析动态生成的内容。这很容易实现,但引入了新的依赖项(整个浏览器!)并且运行速度相当慢。
2) 分析加载 Java 脚本用于呈现页面的内容的 AJAX 调用。使用浏览器的开发人员工具查找实际调用。然后直接从 Java 中调用它并解析该数据。通常此类数据以 JSON 格式传输,因此 Jsoup 在这里的帮助有限。此选项需要更多的努力,但运行速度更快,并且不会为您的项目添加更多的依赖项。
我试过这样的网络驱动程序:
System.setProperty(MyChromeExePath);
WebDriver webDriver = new ChromeDriver();
webDriver.get(trivagoUrl.toString());
String pageSource = webDriver.getPageSource();
在这行 WebDriver webDriver = new ChromeDriver();
之后,浏览器打开了。之后它抛出时间异常错误
2018-08-24 18:52:01.116[0;39m [31mERROR[0;39m [35m29316[0;39m [2m---[0;39m [2m[nio-8080-exec-6][0;39m [36mo.a.c.c.C.[.[.[/].[dispatcherServlet] [0;39m [2m:[0;39m Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.openqa.selenium.WebDriverException: Timed out waiting for driver server to start.
Build info: version: '3.9.1', revision: '63f7b50', time: '2018-02-07T22:25:02.294Z'
System info: host: 'DESKTOP-RP0T36G', ip: '192.168.1.21', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_121'
Driver info: driver.version: ChromeDriver] with root cause
java.util.concurrent.TimeoutException: null
at java.util.concurrent.FutureTask.get(Unknown Source) ~[na:1.8.0_121]
at com.google.common.util.concurrent.SimpleTimeLimiter.callWithTimeout(SimpleTimeLimiter.java:148) ~[guava-23.6-jre.jar:na]
at org.openqa.selenium.net.UrlChecker.waitUntilAvailable(UrlChecker.java:75) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.remote.service.DriverService.waitUntilAvailable(DriverService.java:187) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.remote.service.DriverService.start(DriverService.java:178) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:79) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:601) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:219) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:142) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:181) ~[selenium-chrome-driver-3.9.1.jar:na]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:168) ~[selenium-chrome-driver-3.9.1.jar:na]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:123) ~[selenium-chrome-driver-3.9.1.jar:na]
at com.io.zizu.m2m.parse.command.TrivagoSearchCommand.getSearchResults(TrivagoSearchCommand.java:131) ~[main/:na]
at com.io.zizu.m2m.parse.command.TrivagoSearchCommand$$FastClassBySpringCGLIB$$a6dcf772.invoke(<generated>) ~[main/:na]
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) ~[spring-core-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:746) ~[spring-aop-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.cache.interceptor.CacheInterceptor.lambda$invoke[=11=](CacheInterceptor.java:53) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.cache.interceptor.CacheAspectSupport.invokeOperation(CacheAspectSupport.java:336) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:391) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:316) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.cache.interceptor.CacheInterceptor.invoke(CacheInterceptor.java:61) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:185) ~[spring-aop-5.0.8.RELEASE.jar:5.0.8.RELEASE]
我想用 jsoup 解析网页。但其返回的 html 与浏览器检查不同。当我从浏览器转到网页时,我可以在 <section id='js_item_list_section'>......</section>
下看到 <ol>
标签。但是如果我在 spring 引导项目中使用 jsoup 调用网页,我在该部分下看不到 <ol>
标签。 section 下还有一个<div key="">
。返回的回复如下:
JSOUP 响应:
<section id="js_item_list_section" class="item-list item-list--loading clearfix">
<div key="itemlist-loader" class="ellipsis-loader-wrapper ellipsis-loader-wrapper--text ellipsis-loader-wrapper--top">
<div class="ellipsis-loader ellipsis-loader--branded center-x">
<div class="ellipsis-loader__dot"></div>
<div class="ellipsis-loader__dot"></div>
<div class="ellipsis-loader__dot"></div>
</div>
<span class="loader-text center-x">Yükleniyor</span>
</div>
</section>
Web 浏览器 (Chrome) 检查员:
<section id="js_item_list_section" class="item-list clearfix">
<ol>
<li>.....</li>
<li>.....</li>
<ol>
</section>
我想大概是React.js。
还有我的代码块:
Document document = Jsoup.connect(myUrl)
.ignoreContentType(true)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36")
.get();
Element itemListSection = document.getElementById("js_item_list_section");
问题很可能是您尝试解析的页面包含动态生成的内容(js_item_list_section 已经暗示 Java用于呈现此内容的脚本)
JSoup 不解释 JavaScript,因此也不会加载通过 AJAX 调用访问的内容。所以不幸的是,JSoup 不能按照你想要的方式使用。
我看到你有两个选择:
1) 使用像 selenium web driver 这样的工具,它从 Java 控制真实的浏览器,还允许解析动态生成的内容。这很容易实现,但引入了新的依赖项(整个浏览器!)并且运行速度相当慢。
2) 分析加载 Java 脚本用于呈现页面的内容的 AJAX 调用。使用浏览器的开发人员工具查找实际调用。然后直接从 Java 中调用它并解析该数据。通常此类数据以 JSON 格式传输,因此 Jsoup 在这里的帮助有限。此选项需要更多的努力,但运行速度更快,并且不会为您的项目添加更多的依赖项。
我试过这样的网络驱动程序:
System.setProperty(MyChromeExePath);
WebDriver webDriver = new ChromeDriver();
webDriver.get(trivagoUrl.toString());
String pageSource = webDriver.getPageSource();
在这行 WebDriver webDriver = new ChromeDriver();
之后,浏览器打开了。之后它抛出时间异常错误
2018-08-24 18:52:01.116[0;39m [31mERROR[0;39m [35m29316[0;39m [2m---[0;39m [2m[nio-8080-exec-6][0;39m [36mo.a.c.c.C.[.[.[/].[dispatcherServlet] [0;39m [2m:[0;39m Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.openqa.selenium.WebDriverException: Timed out waiting for driver server to start.
Build info: version: '3.9.1', revision: '63f7b50', time: '2018-02-07T22:25:02.294Z'
System info: host: 'DESKTOP-RP0T36G', ip: '192.168.1.21', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_121'
Driver info: driver.version: ChromeDriver] with root cause
java.util.concurrent.TimeoutException: null
at java.util.concurrent.FutureTask.get(Unknown Source) ~[na:1.8.0_121]
at com.google.common.util.concurrent.SimpleTimeLimiter.callWithTimeout(SimpleTimeLimiter.java:148) ~[guava-23.6-jre.jar:na]
at org.openqa.selenium.net.UrlChecker.waitUntilAvailable(UrlChecker.java:75) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.remote.service.DriverService.waitUntilAvailable(DriverService.java:187) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.remote.service.DriverService.start(DriverService.java:178) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:79) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:601) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:219) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:142) ~[selenium-remote-driver-3.9.1.jar:na]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:181) ~[selenium-chrome-driver-3.9.1.jar:na]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:168) ~[selenium-chrome-driver-3.9.1.jar:na]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:123) ~[selenium-chrome-driver-3.9.1.jar:na]
at com.io.zizu.m2m.parse.command.TrivagoSearchCommand.getSearchResults(TrivagoSearchCommand.java:131) ~[main/:na]
at com.io.zizu.m2m.parse.command.TrivagoSearchCommand$$FastClassBySpringCGLIB$$a6dcf772.invoke(<generated>) ~[main/:na]
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) ~[spring-core-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:746) ~[spring-aop-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.cache.interceptor.CacheInterceptor.lambda$invoke[=11=](CacheInterceptor.java:53) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.cache.interceptor.CacheAspectSupport.invokeOperation(CacheAspectSupport.java:336) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:391) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:316) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.cache.interceptor.CacheInterceptor.invoke(CacheInterceptor.java:61) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:185) ~[spring-aop-5.0.8.RELEASE.jar:5.0.8.RELEASE]