Scrapy 遵循 link 但不遵循 return 数据，可能是时间问题？

Question

我尝试了几个设置，比如延迟下载时间，控制台似乎没有错误，选择器return从Scrapy Shell

正确的数据

该站点在域中使用了不同的前缀，这可能是原因吗？ slist.amiami.jp 我尝试了多种域和 URL 的变体，但都导致相同的无数据响应 returned

知道为什么它不为 -o CSV 文件收集任何数据吗？有什么建议谢谢

预期输出是 return 产品页面中的 JAN 代码和类别文本

2021-05-13 23:59:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-05-13 23:59:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6026
2021-05-13 23:59:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.jp/top/search/list?s_keywords=4967834601246> (referer: None)
2021-05-13 23:59:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.jp/top/detail/detail?gcode=TOY-SCL-05454> (referer: https://example.jp/top/search/list?s_keywords=4967834601246)
2021-05-13 23:59:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.jp/top/search/list?s_keywords=4543736302216> (referer: None)
2021-05-14 00:00:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.jp/top/search/list?s_keywords=44536318620013> (referer: None)
2021-05-14 00:00:04 [scrapy.core.engine] INFO: Closing spider (finished)
2021-05-14 00:00:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1115,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'elapsed_time_seconds': 29.128242,
 'finish_reason': 'finished',

import scrapy


class exampledataSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.jp']
    start_urls = ['https://example.jp/top/search/list?s_keywords=4967834601246',
                  'https://example.jp/top/search/list?s_keywords=4543736302216',
                  'https://example.jp/top/search/list?s_keywords=44536318620013',

                  ]

    def parse(self, response):
        for link in response.css('div.product_box a::attr(href)'):
            yield response.follow(link.get(), callback=self.item)


    def item(self, response):
        products = response.css('div.maincontents')
        for product in products:
            yield {
                'JAN': product.css('dd.jancode::text').getall(),
                'title': product.css('div.pankuzu a::text').getall()
            }

Answer 1

似乎 products = response.css('div.maincontents') 选择器不正确，我不得不为数据做 2 个单独的父子请求

事实证明，您可以简单地 YEILD 列表中的元素

'''

def 输出（自我，响应）：

    yield {
        'firstitem': response.css('example td:nth-of-type(2)::text').getall(),
        'seconditem': response.css('example td:nth-of-type(2)::text').getall(),
        'thrditem': response.css('example td:nth-of-type(2)::text').getall()
    }

'''

Scrapy 遵循 link 但不遵循 return 数据，可能是时间问题？

Scrapy follows link but does not return data, possible timing issue?

python

scrapy

web-scraping