Scrapy: TypeError: 'Request' object is not iterable
Scrapy: TypeError: 'Request' object is not iterable
我正在使用 Scrapy (1.1.2) 制作一个蜘蛛来报废产品。我设法让它工作并收集了足够的数据,但现在,我希望每个元素都向 product page
发出新请求并废弃,例如产品描述。
首先,这是我最后的工作代码
spider.py(除外)
class ProductScrapSpider(Spider):
name = "dmoz"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/index.php?id_category=24"
# ...
]
def parse(self, response):
for sel in response.xpath("a long string"):
mainloader = ProductLoader(selector=sel)
mainloader.add_value('category', 'Category Name')
mainloader.add_value('meta', self.get_meta(sel))
# more data
yield mainloader.load_item()
# Follows the pagination
next_page = response.css("li#pagination_next a::attr('href')")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def get_meta(self, response):
metaloader = ProductMetaLoader(selector=response)
metaloader.add_value('store', "Store name")
# more data
yield metaloader.load_item()
输出
[
{
"category": "Category Name",
"price": 220000,
"meta": {
"baseURL": "",
"name": "",
"store": "Store Name"
},
"reference": "100XXX100"
},
...
]
阅读文档和此处的一些答案后,我更改了 get_meta
方法并为请求添加了回调 get_product_page
:
new_spider.py(除了)
def get_meta(self, response):
metaloader = ProductMetaLoader(selector=response)
metaloader.add_value('store', "Store name")
# more data
items = metaloader.load_item()
new_request = scrapy.Request(items['url'], callback=self.get_product_page)
# Passing the metadata
new_request.meta['item'] = items
# The source of the problem
yield new_request
def get_product_page(self, response):
sel = response.selector.css('.product_description')
items = response.meta['item']
new_meta = items
new_meta.update({'product_page': sel[0].extract()})
return new_meta
预期输出
[
{
"category": "Category Name",
"price": 220000,
"meta": {
"baseURL": "",
"name": "",
"store": "Store Name",
"product_page": "<div> [...] </div>"
},
"reference": "100XXX100"
},
...
]
错误
TypeError: 'Request' object is not iterable
我找不到关于这个错误的任何信息,所以请帮我修复它。
非常感谢。
您遇到的错误 (TypeError: 'Request' object is not iterable
) 是因为 Request
实例被放入项目的字段中(在更新的 get_meta
方法函数中),而提要出口商无法序列化它。
您需要 return 向 Scrapy 发送元数据请求,以及一个元参数以传递半解析的项目。这是更新的 parse
方法和新的 parse_get_meta
方法的示例:
def parse(self, response):
for sel in response.xpath("a long string"):
mainloader = ProductLoader(selector=sel)
mainloader.add_value('category', 'Category Name')
#mainloader.add_value('meta', self.get_meta(sel))
# more data
item = mainloader.load_item()
get_meta_req = self.get_meta(sel)
get_meta_req['meta']['item'] = item
yield get_meta_req.replace(callback=self.parse_get_meta)
def parse_get_meta(self, response):
"""Parses a get meta response"""
item = response.meta['item']
# Parse the response and load the data here, e.g. item['foo'] = bar
pass
# Finally return the item
return item
我正在使用 Scrapy (1.1.2) 制作一个蜘蛛来报废产品。我设法让它工作并收集了足够的数据,但现在,我希望每个元素都向 product page
发出新请求并废弃,例如产品描述。
首先,这是我最后的工作代码
spider.py(除外)
class ProductScrapSpider(Spider):
name = "dmoz"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/index.php?id_category=24"
# ...
]
def parse(self, response):
for sel in response.xpath("a long string"):
mainloader = ProductLoader(selector=sel)
mainloader.add_value('category', 'Category Name')
mainloader.add_value('meta', self.get_meta(sel))
# more data
yield mainloader.load_item()
# Follows the pagination
next_page = response.css("li#pagination_next a::attr('href')")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def get_meta(self, response):
metaloader = ProductMetaLoader(selector=response)
metaloader.add_value('store', "Store name")
# more data
yield metaloader.load_item()
输出
[
{
"category": "Category Name",
"price": 220000,
"meta": {
"baseURL": "",
"name": "",
"store": "Store Name"
},
"reference": "100XXX100"
},
...
]
阅读文档和此处的一些答案后,我更改了 get_meta
方法并为请求添加了回调 get_product_page
:
new_spider.py(除了)
def get_meta(self, response):
metaloader = ProductMetaLoader(selector=response)
metaloader.add_value('store', "Store name")
# more data
items = metaloader.load_item()
new_request = scrapy.Request(items['url'], callback=self.get_product_page)
# Passing the metadata
new_request.meta['item'] = items
# The source of the problem
yield new_request
def get_product_page(self, response):
sel = response.selector.css('.product_description')
items = response.meta['item']
new_meta = items
new_meta.update({'product_page': sel[0].extract()})
return new_meta
预期输出
[
{
"category": "Category Name",
"price": 220000,
"meta": {
"baseURL": "",
"name": "",
"store": "Store Name",
"product_page": "<div> [...] </div>"
},
"reference": "100XXX100"
},
...
]
错误
TypeError: 'Request' object is not iterable
我找不到关于这个错误的任何信息,所以请帮我修复它。
非常感谢。
您遇到的错误 (TypeError: 'Request' object is not iterable
) 是因为 Request
实例被放入项目的字段中(在更新的 get_meta
方法函数中),而提要出口商无法序列化它。
您需要 return 向 Scrapy 发送元数据请求,以及一个元参数以传递半解析的项目。这是更新的 parse
方法和新的 parse_get_meta
方法的示例:
def parse(self, response):
for sel in response.xpath("a long string"):
mainloader = ProductLoader(selector=sel)
mainloader.add_value('category', 'Category Name')
#mainloader.add_value('meta', self.get_meta(sel))
# more data
item = mainloader.load_item()
get_meta_req = self.get_meta(sel)
get_meta_req['meta']['item'] = item
yield get_meta_req.replace(callback=self.parse_get_meta)
def parse_get_meta(self, response):
"""Parses a get meta response"""
item = response.meta['item']
# Parse the response and load the data here, e.g. item['foo'] = bar
pass
# Finally return the item
return item