Scrapy - 过滤异地请求但在允许的域中?
Scrapy - Filtering offsite request but in allowed domains?
我有以下代码,想逐步转到网站的下一页:
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['https://www.tripadvisor.co.uk']
start_urls = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html']
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
yield {
"link": response.urljoin(elem.xpath(".//a/@href").get())
}
nextPage = response.xpath("//a[@aria-label='Next page']/@href").get()
if nextPage != None:
nextPage = response.urljoin(nextPage)
yield scrapy.Request(nextPage, callback=self.parse)
但是当我 运行 这个代码时,只有第一页被抓取,我得到这个错误消息:
2021-11-17 12:52:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.tripadvisor.co.uk': <GET https://www.tripadvisor.co.uk/ClientLink?value=NVB5Xy9BdHRyYWN0aW9ucy1nMTg2MjE2LUFjdGl2aXRpZXMtYzQ4LWFfYWxsQXR0cmFjdGlvbnMudHJ1ZS1vYTMwLVVuaXRlZF9LaW5nZG9tLmh0bWxfQ3Yx>
只有当我删除这一行时,我才会得到所有结果
allowed_domains = ['https://www.tripadvisor.co.uk']
为什么 - 以下站点的 link 具有允许的域?
在默认蜘蛛中 allowed_domains
不是强制性的。为了最大限度地减少错误,最好将其排除在外。
另一点是你可以删除 allowed_domains
或者你必须排除
https://
意味着您可以将 www.tripadvisor.co.uk
作为 allowed_domains
根据 scrapy 文档。这就是为什么你会收到此 https://
错误的原因
部分。
正确的做法如下:
allowed_domains = ['www.tripadvisor.co.uk']
我有以下代码,想逐步转到网站的下一页:
import scrapy
class ZoosSpider(scrapy.Spider):
name = 'zoos'
allowed_domains = ['https://www.tripadvisor.co.uk']
start_urls = ['https://www.tripadvisor.co.uk/Attractions-g186216-Activities-c48-a_allAttractions.true-United_Kingdom.html']
def parse(self, response):
tmpSEC = response.xpath("//section[@data-automation='AppPresentation_SingleFlexCardSection']")
for elem in tmpSEC:
yield {
"link": response.urljoin(elem.xpath(".//a/@href").get())
}
nextPage = response.xpath("//a[@aria-label='Next page']/@href").get()
if nextPage != None:
nextPage = response.urljoin(nextPage)
yield scrapy.Request(nextPage, callback=self.parse)
但是当我 运行 这个代码时,只有第一页被抓取,我得到这个错误消息:
2021-11-17 12:52:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.tripadvisor.co.uk': <GET https://www.tripadvisor.co.uk/ClientLink?value=NVB5Xy9BdHRyYWN0aW9ucy1nMTg2MjE2LUFjdGl2aXRpZXMtYzQ4LWFfYWxsQXR0cmFjdGlvbnMudHJ1ZS1vYTMwLVVuaXRlZF9LaW5nZG9tLmh0bWxfQ3Yx>
只有当我删除这一行时,我才会得到所有结果
allowed_domains = ['https://www.tripadvisor.co.uk']
为什么 - 以下站点的 link 具有允许的域?
在默认蜘蛛中 allowed_domains
不是强制性的。为了最大限度地减少错误,最好将其排除在外。
另一点是你可以删除 allowed_domains
或者你必须排除
https://
意味着您可以将 www.tripadvisor.co.uk
作为 allowed_domains
根据 scrapy 文档。这就是为什么你会收到此 https://
错误的原因
部分。
正确的做法如下:
allowed_domains = ['www.tripadvisor.co.uk']