Scrapy

Question

我有以下代码，想逐步转到网站的下一页：

import scrapy

class ZoosSpider(scrapy.Spider):
    name = 'zoos'
    allowed_domains = ['https://www.tripadvisor.co.uk']
    start_urls = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html']

    def parse(self, response):
        tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
        for elem in tmpSEC:
          yield {
            "link": response.urljoin(elem.xpath(".//a/@href").get())
          }

        nextPage = response.xpath("//a[@aria-label='Next page']/@href").get()
        if nextPage != None:
          nextPage = response.urljoin(nextPage)
          yield scrapy.Request(nextPage, callback=self.parse)

但是当我运行这个代码时，只有第一页被抓取，我得到这个错误消息：

2021-11-17 12:52:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.tripadvisor.co.uk': <GET https://www.tripadvisor.co.uk/ClientLink?value=NVB5Xy9BdHRyYWN0aW9ucy1nMTg2MjE2LUFjdGl2aXRpZXMtYzQ4LWFfYWxsQXR0cmFjdGlvbnMudHJ1ZS1vYTMwLVVuaXRlZF9LaW5nZG9tLmh0bWxfQ3Yx>

只有当我删除这一行时，我才会得到所有结果

allowed_domains = ['https://www.tripadvisor.co.uk']

为什么 - 以下站点的 link 具有允许的域？

Answer 1

在默认蜘蛛中 allowed_domains 不是强制性的。为了最大限度地减少错误，最好将其排除在外。另一点是你可以删除 allowed_domains 或者你必须排除 https:// 意味着您可以将 www.tripadvisor.co.uk 作为 allowed_domains 根据 scrapy 文档。这就是为什么你会收到此 https:// 错误的原因部分。

正确的做法如下：

allowed_domains = ['www.tripadvisor.co.uk']

Scrapy - 过滤异地请求但在允许的域中？

Scrapy - Filtering offsite request but in allowed domains?

python

web-scraping