Jsoup 响应不像浏览器检查

Jsoup response not like browser inspect

我想用 jsoup 解析网页。但其返回的 html 与浏览器检查不同。当我从浏览器转到网页时,我可以在 <section id='js_item_list_section'>......</section> 下看到 <ol> 标签。但是如果我在 spring 引导项目中使用 jsoup 调用网页,我在该部分下看不到 <ol> 标签。 section 下还有一个<div key="">。返回的回复如下:

JSOUP 响应:

<section id="js_item_list_section" class="item-list item-list--loading clearfix">
 <div key="itemlist-loader" class="ellipsis-loader-wrapper ellipsis-loader-wrapper--text ellipsis-loader-wrapper--top">
  <div class="ellipsis-loader ellipsis-loader--branded center-x">
   <div class="ellipsis-loader__dot"></div>
   <div class="ellipsis-loader__dot"></div>
   <div class="ellipsis-loader__dot"></div>
  </div>
  <span class="loader-text center-x">Y&uuml;kleniyor</span>
 </div>
</section>

Web 浏览器 (Chrome) 检查员:

<section id="js_item_list_section" class="item-list clearfix">
  <ol>
     <li>.....</li>
     <li>.....</li>
  <ol>
</section>

我想大概是React.js。

还有我的代码块:

Document document = Jsoup.connect(myUrl)
  .ignoreContentType(true)
  .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36")
  .get();
Element itemListSection  = document.getElementById("js_item_list_section");

问题很可能是您尝试解析的页面包含动态生成的内容(js_item_list_section 已经暗示 Java用于呈现此内容的脚本)

JSoup 不解释 JavaScript,因此也不会加载通过 AJAX 调用访问的内容。所以不幸的是,JSoup 不能按照你想要的方式使用。

我看到你有两个选择:

1) 使用像 selenium web driver 这样的工具,它从 Java 控制真实的浏览器,还允许解析动态生成的内容。这很容易实现,但引入了新的依赖项(整个浏览器!)并且运行速度相当慢。

2) 分析加载 Java 脚本用于呈现页面的内容的 AJAX 调用。使用浏览器的开发人员工具查找实际调用。然后直接从 Java 中调用它并解析该数据。通常此类数据以 JSON 格式传输,因此 Jsoup 在这里的帮助有限。此选项需要更多的努力,但运行速度更快,并且不会为您的项目添加更多的依赖项。

我试过这样的网络驱动程序:

System.setProperty(MyChromeExePath);
        WebDriver webDriver = new ChromeDriver();
        webDriver.get(trivagoUrl.toString());
        String pageSource = webDriver.getPageSource();

在这行 WebDriver webDriver = new ChromeDriver(); 之后,浏览器打开了。之后它抛出时间异常错误

2018-08-24 18:52:01.116[0;39m [31mERROR[0;39m [35m29316[0;39m [2m---[0;39m [2m[nio-8080-exec-6][0;39m [36mo.a.c.c.C.[.[.[/].[dispatcherServlet]   [0;39m [2m:[0;39m Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.openqa.selenium.WebDriverException: Timed out waiting for driver server to start.
Build info: version: '3.9.1', revision: '63f7b50', time: '2018-02-07T22:25:02.294Z'
System info: host: 'DESKTOP-RP0T36G', ip: '192.168.1.21', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_121'
Driver info: driver.version: ChromeDriver] with root cause

java.util.concurrent.TimeoutException: null
    at java.util.concurrent.FutureTask.get(Unknown Source) ~[na:1.8.0_121]
    at com.google.common.util.concurrent.SimpleTimeLimiter.callWithTimeout(SimpleTimeLimiter.java:148) ~[guava-23.6-jre.jar:na]
    at org.openqa.selenium.net.UrlChecker.waitUntilAvailable(UrlChecker.java:75) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.remote.service.DriverService.waitUntilAvailable(DriverService.java:187) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.remote.service.DriverService.start(DriverService.java:178) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:79) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:601) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:219) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:142) ~[selenium-remote-driver-3.9.1.jar:na]
    at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:181) ~[selenium-chrome-driver-3.9.1.jar:na]
    at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:168) ~[selenium-chrome-driver-3.9.1.jar:na]
    at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:123) ~[selenium-chrome-driver-3.9.1.jar:na]
    at com.io.zizu.m2m.parse.command.TrivagoSearchCommand.getSearchResults(TrivagoSearchCommand.java:131) ~[main/:na]
    at com.io.zizu.m2m.parse.command.TrivagoSearchCommand$$FastClassBySpringCGLIB$$a6dcf772.invoke(<generated>) ~[main/:na]
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204) ~[spring-core-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:746) ~[spring-aop-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.cache.interceptor.CacheInterceptor.lambda$invoke[=11=](CacheInterceptor.java:53) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.cache.interceptor.CacheAspectSupport.invokeOperation(CacheAspectSupport.java:336) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:391) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:316) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.cache.interceptor.CacheInterceptor.invoke(CacheInterceptor.java:61) ~[spring-context-5.0.8.RELEASE.jar:5.0.8.RELEASE]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:185) ~[spring-aop-5.0.8.RELEASE.jar:5.0.8.RELEASE]