java 中的网络抓取

Question

我遇到这样一种情况，我需要抓取一组只包含一些 xml 数据的网页，并且我想获取特定元素的属性。我怎样才能在 java 中做到这一点？

也就是说，xml结构是

<page>
       <student id=2406>
        .
        .
       </student>

       .
       . 
       . 
</page>

我需要抓取很多页面，所以请推荐一个快速抓取工具

编辑：我看过一些与此相关的页面，但没有找到一个公平的答案。另外任何代码将不胜感激

Answer 1

Jsoup 会是一个很好的爬虫。以下是您可以使用它执行的操作：

String xml = "this would be your xml";
Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
for (Element e : doc.select("tag")) {
    System.out.println(e); //this will print the node with "tag"
}

要抓取网页，请使用以下代码：

Document doc = Jsoup.connect("url").get();

java 中的网络抓取

Web crawling in java

java

xml

web-crawler