Stormcrawler

Question

我正在使用 stormcrawler 将数据放入一些 Elasticsearch 索引中，我在状态索引中有一堆 URL，具有各种状态 - 已发现、已获取、错误等。

我想知道我是否可以告诉 StormCrawler 只抓取 https 和状态为：DISCOVERED 的 url，以及这是否真的有效。我将 es-conf.yaml 设置如下：

es.status.filterQuery: "-(url:https* AND status:DISCOVERED)"

对吗？ SC 如何使用 es.status.filterQuery？是否运行搜索并将该值应用为过滤器以仅检索要获取的适用文档？

Answer 1

参见code of the AggregationSpout。

how does SC make use of the es.status.filterQuery? Does it run a search and apply the value as a filter to retrieve only the applicable documents to fetch?

是的，它过滤发送到 ES 分片的查询。这对于处理抓取的子集很有用。

这是一个正过滤器，即文档必须与查询匹配才能被检索；您需要删除 - 才能执行您描述的操作。

Stormcrawler - es.status.filterQuery 是如何工作的？

Stormcrawler - how does the es.status.filterQuery work?

web-crawler

elasticsearch