说明 Stormcrawler 的 default-regex-filters.txt 是如何工作的

Question

使用 Stormcrawler，如果我将 -^(http|https):\/\/example.com\/page\/?date 添加到 default-regex-filters.txt 但我仍然看到

2019-03-20 08:49:58.110 c.d.s.b.JSoupParserBolt Thread-5-parse-executor[7 7] [INFO] Parsing : starting https://example.com/page/?date=1999-9-16&t=list
2019-03-20 08:49:58.117 c.d.s.b.JSoupParserBolt Thread-5-parse-executor[7 7] [INFO] Parsed https://example.com/page/?date=1999-9-16&t=list in 6 msec

在日志中，但索引中未显示任何文档。 Stormcrawler 是在避开 url，还是仍在获取它，还是只是从状态 table 中检索 url 然后对其进行评估？

Answer 1

过滤应用于外链 post-解析，'surviving' URLs 被发送到状态更新程序螺栓。它会影响 URL 的发现，换句话说，如果一个 URL 是由 spout 发送的，它将被处理。

说明 Stormcrawler 的 default-regex-filters.txt 是如何工作的

Clarification on how Stormcrawler's default-regex-filters.txt works

web-crawler

stormcrawler