如何设置简单JAVA网络爬虫的深度

How to set depth of simple JAVA web crawler

我写了一个简单的递归网络爬虫来递归地从网页中获取 URL 链接。

现在我想找出一种方法来限制爬虫使用深度,但我不确定如何通过特定深度限制爬虫(我可以通过前 N 个链接限制爬虫,但我想限制使用深度)

For Ex: Depth 2 should fetch Parent links -> children(s) links--> children(s) link

欢迎任何意见。

    public class SimpleCrawler {

    static Map<String, String> retMap = new ConcurrentHashMap<String, String>();    

        public static void main(String args[]) throws IOException {
         StringBuffer sb = new StringBuffer();  
         Map<String, String> map = (returnURL("http://www.google.com"));
         recursiveCrawl(map);
          for (Map.Entry<String, String> entry : retMap.entrySet()) {
            sb.append(entry.getKey());
          }
        }

        public static void recursiveCrawl(Map<String, String> map)
                throws IOException {
            for (Map.Entry<String, String> entry : map.entrySet()) {
                String key = entry.getKey();
                Map<String, String> recurSive = returnURL(key);
                recursiveCrawl(recurSive);
            }
        }

        public synchronized static Map<String, String> returnURL(String URL)
                throws IOException {

            Map<String, String> tempMap = new HashMap<String, String>();
            Document doc = null;
            if (URL != null && !URL.equals("") && !retMap.containsKey(URL)) {
                System.out.println("Processing==>" + URL);
                try {
                    URL url = new URL(URL);
                    System.setProperty("http.proxyHost", "proxy");
                    System.setProperty("http.proxyPort", "port");
                    doc = Jsoup.connect(URL).get();
                    if (doc != null) {
                        Elements links = doc.select("a");
                        String FinalString = "";
                        for (Element e : links) {
                            FinalString = "http:" + e.attr("href");
                            if (!retMap.containsKey(FinalString)) {
                                tempMap.put(FinalString, FinalString);
                            }
                        }
                    }
                } catch (Exception e) {
                    e.printStackTrace();
                }
                retMap.put(URL, URL);
            } else {
                System.out.println("****Skipping URL****" + URL);
            }
            return tempMap;
        }

    }

编辑 1:

我想到了使用工作列表,因此修改了代码。我也不确定如何在这里设置深度(我可以设置要抓取的网页数量,但不能设置深度)。如有任何建议,我们将不胜感激。

public void startCrawl(String url) {
        while (this.pagesVisited.size() < 2) {
            String currentUrl;
            SpiderLeg leg = new SpiderLeg();
            if (this.pagesToVisit.isEmpty()) {
                currentUrl = url;
                this.pagesVisited.add(url);
            } else {
                currentUrl = this.nextUrl();
            }
            leg.crawl(currentUrl);
            System.out.println("pagesToVisit Size" + pagesToVisit.size());
            // SpiderLeg
            this.pagesToVisit.addAll(leg.getLinks());
        }
        System.out.println("\n**Done** Visited " + this.pagesVisited.size()
                + " web page(s)");
    }

你可以这样做:

static int maxLevels = 10;

public static void main(String args[]) throws IOException {
     ...
     recursiveCrawl(map,0);
     ...
}

public static void recursiveCrawl(Map<String, String> map, int level) throws IOException {
    for (Map.Entry<String, String> entry : map.entrySet()) {
        String key = entry.getKey();
        Map<String, String> recurSive = returnURL(key);
        if (level < maxLevels) {
            recursiveCrawl(recurSive, ++level);
        }
    }
}

此外,您可以使用 Set 而不是 Map

您可以在递归方法的签名上添加深度参数,例如

在你的主

recursiveCrawl(map,0);

public static void recursiveCrawl(Map<String, String> map, int depth) throws IOException {
    if (depth++ < DESIRED_DEPTH) //assuming initial depth = 0
        for (Map.Entry<String, String> entry : map.entrySet()) {
            String key = entry.getKey();
            Map<String, String> recurSive = returnURL(key);
            recursiveCrawl(recurSive, depth);
        }
    }
]

基于非递归方法:

保留类型为 CrawlURL

的 URL pagesToCrawl 的工作列表
class CrawlURL {
  public String url;
  public int depth;

  public CrawlURL(String url, int depth) {
    this.url = url;
    this.depth = depth;
  }
}

最初(进入循环之前):

Queue<CrawlURL> pagesToCrawl = new LinkedList<>();
pagesToCrawl.add(new CrawlURL(rootUrl, 0)); //rootUrl is the url to start from

现在循环:

while (!pagesToCrawl.isEmpty()) { // will proceed at least once (for rootUrl)
  CrawlURL currentUrl = pagesToCrawl.remove();
  //analyze the url
  //updated with crawled links
}

并使用 links 进行更新:

if (currentUrl.depth < 2) {
  for (String url : leg.getLinks()) { // referring to your analysis result
    pagesToCrawl.add(new CrawlURL(url, currentUrl.depth + 1));
  }
}

您可以使用其他元数据(例如 link 名称、引荐来源网址等)来增强 CrawlURL。

选择: 在我上面的评论中,我提到了一种世代相传的方法。它比这个复杂一点。基本思想是保持列表(currentPagesToCrawlfuturePagesToCrawl)以及生成变量(从 0 开始,每次 currentPagesToCrawl 变空时递增)。所有抓取的 url 都被放入 futurePagesToCrawl 队列,如果 currentPagesToCrawl 两个列表将被切换。这样做直到生成变量达到 2.