Scrapy 遵循 link 但不遵循 return 数据,可能是时间问题?
Scrapy follows link but does not return data, possible timing issue?
我尝试了几个设置,比如延迟下载时间,控制台似乎没有错误,选择器return从Scrapy Shell
正确的数据
该站点在域中使用了不同的前缀,这可能是原因吗? slist.amiami.jp
我尝试了多种域和 URL 的变体,但都导致相同的无数据响应 returned
知道为什么它不为 -o CSV 文件收集任何数据吗?有什么建议谢谢
预期输出是 return 产品页面中的 JAN 代码和类别文本
2021-05-13 23:59:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-05-13 23:59:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6026
2021-05-13 23:59:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.jp/top/search/list?s_keywords=4967834601246> (referer: None)
2021-05-13 23:59:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.jp/top/detail/detail?gcode=TOY-SCL-05454> (referer: https://example.jp/top/search/list?s_keywords=4967834601246)
2021-05-13 23:59:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.jp/top/search/list?s_keywords=4543736302216> (referer: None)
2021-05-14 00:00:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.jp/top/search/list?s_keywords=44536318620013> (referer: None)
2021-05-14 00:00:04 [scrapy.core.engine] INFO: Closing spider (finished)
2021-05-14 00:00:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1115,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'elapsed_time_seconds': 29.128242,
'finish_reason': 'finished',
import scrapy
class exampledataSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.jp']
start_urls = ['https://example.jp/top/search/list?s_keywords=4967834601246',
'https://example.jp/top/search/list?s_keywords=4543736302216',
'https://example.jp/top/search/list?s_keywords=44536318620013',
]
def parse(self, response):
for link in response.css('div.product_box a::attr(href)'):
yield response.follow(link.get(), callback=self.item)
def item(self, response):
products = response.css('div.maincontents')
for product in products:
yield {
'JAN': product.css('dd.jancode::text').getall(),
'title': product.css('div.pankuzu a::text').getall()
}
似乎 products = response.css('div.maincontents')
选择器不正确,我不得不为数据做 2 个单独的父子请求
事实证明,您可以简单地 YEILD 列表中的元素
'''
def 输出(自我,响应):
yield {
'firstitem': response.css('example td:nth-of-type(2)::text').getall(),
'seconditem': response.css('example td:nth-of-type(2)::text').getall(),
'thrditem': response.css('example td:nth-of-type(2)::text').getall()
}
'''
我尝试了几个设置,比如延迟下载时间,控制台似乎没有错误,选择器return从Scrapy Shell
正确的数据该站点在域中使用了不同的前缀,这可能是原因吗? slist.amiami.jp 我尝试了多种域和 URL 的变体,但都导致相同的无数据响应 returned
知道为什么它不为 -o CSV 文件收集任何数据吗?有什么建议谢谢
预期输出是 return 产品页面中的 JAN 代码和类别文本
2021-05-13 23:59:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-05-13 23:59:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6026
2021-05-13 23:59:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.jp/top/search/list?s_keywords=4967834601246> (referer: None)
2021-05-13 23:59:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.jp/top/detail/detail?gcode=TOY-SCL-05454> (referer: https://example.jp/top/search/list?s_keywords=4967834601246)
2021-05-13 23:59:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.jp/top/search/list?s_keywords=4543736302216> (referer: None)
2021-05-14 00:00:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.jp/top/search/list?s_keywords=44536318620013> (referer: None)
2021-05-14 00:00:04 [scrapy.core.engine] INFO: Closing spider (finished)
2021-05-14 00:00:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1115,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'elapsed_time_seconds': 29.128242,
'finish_reason': 'finished',
import scrapy
class exampledataSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.jp']
start_urls = ['https://example.jp/top/search/list?s_keywords=4967834601246',
'https://example.jp/top/search/list?s_keywords=4543736302216',
'https://example.jp/top/search/list?s_keywords=44536318620013',
]
def parse(self, response):
for link in response.css('div.product_box a::attr(href)'):
yield response.follow(link.get(), callback=self.item)
def item(self, response):
products = response.css('div.maincontents')
for product in products:
yield {
'JAN': product.css('dd.jancode::text').getall(),
'title': product.css('div.pankuzu a::text').getall()
}
似乎 products = response.css('div.maincontents')
选择器不正确,我不得不为数据做 2 个单独的父子请求
事实证明,您可以简单地 YEILD 列表中的元素
'''
def 输出(自我,响应):
yield {
'firstitem': response.css('example td:nth-of-type(2)::text').getall(),
'seconditem': response.css('example td:nth-of-type(2)::text').getall(),
'thrditem': response.css('example td:nth-of-type(2)::text').getall()
}
'''