如何根据当前URL设置规则？

Question

我正在使用 Scrapy，我希望能够对爬虫有更多的控制。为此，我想根据我正在处理的当前 URL 设置规则。

例如，如果我在 example.com/a，我想应用 LinkExtractor(restrict_xpaths='//div[@class="1"]') 的规则。如果我在 example.com/b 上，我想使用另一个具有不同 Link 提取器的规则。

我该如何完成？

Answer 1

我只是在单独的回调中对它们进行编码，而不是依赖于 CrawlSpider 规则。

def parse(self, response):
    extractor = LinkExtractor(.. some default ..)

    if 'example.com/a' in response.url:
        extractor = LinkExtractor(restrict_xpaths='//div[@class="1"]')

    for link in extractor.extract_links(response):
        yield scrapy.Request(link.url, callback=self.whatever)

这比尝试在运行时更改规则要好，因为所有回调的规则应该是相同的。

在这种情况下，我只使用了 link 提取器，但如果您想使用不同的规则，您可以做同样的事情，镜像相同的代码来处理规则 in the loop shown from CrawlSpider._requests_to_follow。

如何根据当前URL设置规则？

How to set a rule according to the current URL?

python

scrapy