Scrapy 没有执行 CrawlSpider 的所有规则

Scrapy not executing all rules for CrawlSpider

我有以下刮板:

from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class SpiderOpUpcoming(CrawlSpider):
    
    name = "upcoming"
    start_urls = ["https://www.oddsportal.com/tennis/"]
    custom_settings = {"USER_AGENT": "*"}
    
    tournament_linkxtr = LinkExtractor(
        allow="/tennis/",
        restrict_xpaths=(
            "//table[@id='sport_content_tennis' and @class='table-main sportcount']"
            "//a[@foo='f']"
        ),
    )
    match_linkxtr = LinkExtractor(
        allow="/tennis/",
        restrict_xpaths=("//td[@class='name table-participant']//a"),
    )
    
    rules = (
        Rule(tournament_linkxtr, callback="parse_tournament", follow=True),
        Rule(match_linkxtr, callback="parse_match", follow=True),
    )
    handle_httpstatus_list = [301]

    def parse_tournament(self, response):
        print("TOURNAMENT -", response.url)

    def parse_match(self, response):
        print("MATCH -", response.url)


process = CrawlerProcess()
process.crawl(SpiderOpUpcoming)
process.start()

parse_tournament 打印正常,但 parse_match 打印不成功。

为了排除故障,我将 start_urls 更改为来自 'tournament' 页面的 URL,该页面由上面的抓取工具通过 tournament_linkxtr 抓取。然后我删除了 Rulecallbackparse_tournament。见下文:

start_urls = ["https://www.oddsportal.com/tennis/usa/champaign-challenger-men"]
rules = (Rule(match_linkxtr, callback="parse_match", follow=True))

然后刮刀从 parse_match 打印,所以 XPath 没问题。我也不认为我对规则顺序有任何问题,所以我很困惑。

谁能指出我哪里出错了?

Scrapy版本为2.4.1。 OSX 2016 年 MacBook Pro 上的蒙特雷。

变化中:

handle_httpstatus_list = [301]

收件人:

httperror_allowed_codes = [301]

帮我整理了这个。请参阅文档 here.