在 Storm Crawler 中确定递归爬网的优先级

Prioritizing recursive crawl in Storm Crawler

web-crawler
nutch
stormcrawler

在抓取万维网时，我想为我的抓取工具提供一个初始 URL 种子列表 - 并希望我的抓取工具在抓取期间自动 'discover' 来自 Internet 的新种子 URL。

我在 Apach Nutch 中看到了这样的选项（也请参阅 generate command of nutch). Is there any such option in Storm Crawler 中的 topN 参数？

StormCrawler 可以处理递归抓取，URL 的优先级排序方式取决于用于存储 URL 的后端。

例如 Elasticsearch module can be used for that, see the README for a short tutorial and the sample config file，默认情况下 spout 将根据其 nextFetchDate (**.sort.field*).

对 URL 进行排序

在 Nutch 中，-topN 参数仅指定要放入下一段的 URL 的最大数量（基于所使用的评分插件提供的分数）。使用 StormCrawler 我们真的不需要等价物，因为事情不是按批处理的，爬网会连续运行。

在 Storm Crawler 中确定递归爬网的优先级

Prioritizing recursive crawl in Storm Crawler

web-crawler

nutch

stormcrawler