为什么 scrapy 没有抓取我的 link 以进行提取

Question

我正在尝试使用一个简单的爬虫来跟踪这个 link https://tonaton.com/en/ads/ghana/electronics，但是 scrapy 没有爬取 link。我不知道我的代码有什么问题。有什么帮助吗？下面是我的代码。

import scrapy


class BusinessesSpider(scrapy.Spider):
    name = 'businesses'
    allowed_domains = ['www.tonaton.com']
    start_urls = ['https://www.tonaton.com/']

    def parse(self, response):
        businesses = response.xpath("//a[@class='link--1t8hM gtm-home-category-link-click']")
        for business in businesses:
            link = business.xpath(".//@href").get()
            category = business.xpath(".//div[2]/p/text()").get()

            yield response.follow(url=link, callback=self.parse_business, meta={'business_category': category})

    def parse_business(self, response):
        category = response.request.meta['business_category']

但这是终端返回的内容

2021-08-27 11:21:56 [scrapy.core.engine] INFO: Spider opened
2021-08-27 11:21:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-08-27 11:21:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-08-27 11:21:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://tonaton.com/> from <GET https://www.tonaton.com/>
2021-08-27 11:21:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://tonaton.com/> from <GET http://tonaton.com/>
2021-08-27 11:21:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tonaton.com/> (referer: None)
2021-08-27 11:21:59 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'tonaton.com': <GET https://tonaton.com/en/ads/ghana/electronics>
2021-08-27 11:21:59 [scrapy.core.engine] INFO: Closing spider (finished)

Answer 1

您需要更改 allowed_domains（www.tonaton.com 与 tonaton.com 是不同的域）：

allowed_domains = ['tonaton.com']

为什么 scrapy 没有抓取我的 link 以进行提取

Why is scrapy not crawling my link for my extraction

python

scrapy

web-scraping