使用 Jsoup 进行实时网页抓取

Question

我有这个网页 https://rrtp.comed.com/pricing-table-today/，我需要从中单独获取有关时间（小时结束）和日前每小时价格列的信息。我尝试了以下代码，

Document doc = Jsoup.connect("https://rrtp.comed.com/pricing-table-today/").get();

for (Element table : doc.select("table.prices three-col")) {
    for (Element row : table.select("tr")) {
        Elements tds = row.select("td");

        if (tds.size() > 2) {
           System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
        }
    }
}

但不幸的是我无法获得我需要的数据。

代码中有什么问题吗..？或者无法抓取此页面...?

需要一些帮助

Answer 1

正如我在评论中所说：

您应该点击 https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717，因为它是您指向的页面上加载数据的来源。

此 link 下的数据不是有效的 html 文档（这就是它对您不起作用的原因），但您可以轻松地使其 "quite" 正确。

您所要做的就是首先获取响应并在其周围添加 <table>..</table> 标记，然后将其解析为 html 文档就足够了。

Connection.Response response = Jsoup.connect("https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717").execute();
Document doc = Jsoup.parse("<table>" + response.body() + "</table>");

for (Element element : doc.select("tr")) {
    System.out.println(element.html());
}

使用 Jsoup 进行实时网页抓取

Real time web crawling using Jsoup

java

web-crawler

jsoup