Scrapy - 过滤异地请求但在允许的域中?

Scrapy - Filtering offsite request but in allowed domains?

我有以下代码,想逐步转到网站的下一页:

import scrapy

class ZoosSpider(scrapy.Spider):
    name = 'zoos'
    allowed_domains = ['https://www.tripadvisor.co.uk']
    start_urls = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html']

    def parse(self, response):
        tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
        for elem in tmpSEC:
          yield {
            "link": response.urljoin(elem.xpath(".//a/@href").get())
          }

        nextPage = response.xpath("//a[@aria-label='Next page']/@href").get()
        if nextPage != None:
          nextPage = response.urljoin(nextPage)
          yield scrapy.Request(nextPage, callback=self.parse)     

但是当我 运行 这个代码时,只有第一页被抓取,我得到这个错误消息:

2021-11-17 12:52:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.tripadvisor.co.uk': <GET https://www.tripadvisor.co.uk/ClientLink?value=NVB5Xy9BdHRyYWN0aW9ucy1nMTg2MjE2LUFjdGl2aXRpZXMtYzQ4LWFfYWxsQXR0cmFjdGlvbnMudHJ1ZS1vYTMwLVVuaXRlZF9LaW5nZG9tLmh0bWxfQ3Yx>

只有当我删除这一行时,我才会得到所有结果

allowed_domains = ['https://www.tripadvisor.co.uk']

为什么 - 以下站点的 link 具有允许的域?

在默认蜘蛛中 allowed_domains 不是强制性的。为了最大限度地减少错误,最好将其排除在外。 另一点是你可以删除 allowed_domains 或者你必须排除 https:// 意味着您可以将 www.tripadvisor.co.uk 作为 allowed_domains 根据 scrapy 文档。这就是为什么你会收到此 https:// 错误的原因 部分。

正确的做法如下:

allowed_domains = ['www.tripadvisor.co.uk']