scrapy 返回第一个项目
scrapy returning first item
我正在学习 scrapy,因为它只返回页面上的第一个项目。有人可以告诉我我做错了什么吗?
下面是我的代码:
class RuvillaSpider(Spider):
name = "RuvillaSpider"
allowded_domains = ["ruvilla.com"]
start_urls = ["https://www.ruvilla.com/men/footwear.html?dir=desc&limit=45&order=news_from_date"]
def parse(self, response):
products = Selector(response).xpath('//div[@class="category-products"]')
if not products:
raise CloseSpider('RuvillaSpider: DONE, NO MORE PAGES.')
for product in products:
item = RuvillaItem()
item['name'] = product.xpath('ul/li/div/div[1]/a/@title').extract()[0]
item['link'] = product.xpath('ul/li/div/div[1]/a/@href').extract()[0]
item['image'] = product.xpath('ul/li/div/div[1]/a/img/@src').extract()[0]
yield item
您的 xpath 似乎 return 只有 1 个产品用于 products
变量。
尝试:
$ scrapy shell "https://www.ruvilla.com/men/footwear.html?dir=desc&limit=45&order=news_from_date"
In[1]: response.xpath('//div[@class="category-products"]')
Out[1]: [<Selector xpath='//div[@class="category-products"]' data=u'<div class="category-products">\n<div cla'>]
所以你的 xpath 似乎不是针对每个单独的项目,而是针对项目所在的容器。要解决这个问题,你需要生成一个 select 每个 产品[=20] 的 xpath =] 容器改为:
def parse(self, response):
products = Selector(response).xpath('//div[@class="category-products"]//li[contains(@class,"item")]')
for product in products:
item = dict()
item['name'] = product.xpath('.//a/@title').extract_first()
item['link'] = product.xpath('.//a/@href').extract_first()
item['image'] = product.xpath('.//a/img/@src').extract_first()
yield item
next_page = response.xpath("//li[@class='current']/following-sibling::li[1]/a/@href").extract_first()
if next_page:
yield Request(next_page)
你的 xpath 错误。
使用这个 xpath:
('//div[@class="category-products"]/ul/li')
我正在学习 scrapy,因为它只返回页面上的第一个项目。有人可以告诉我我做错了什么吗?
下面是我的代码:
class RuvillaSpider(Spider):
name = "RuvillaSpider"
allowded_domains = ["ruvilla.com"]
start_urls = ["https://www.ruvilla.com/men/footwear.html?dir=desc&limit=45&order=news_from_date"]
def parse(self, response):
products = Selector(response).xpath('//div[@class="category-products"]')
if not products:
raise CloseSpider('RuvillaSpider: DONE, NO MORE PAGES.')
for product in products:
item = RuvillaItem()
item['name'] = product.xpath('ul/li/div/div[1]/a/@title').extract()[0]
item['link'] = product.xpath('ul/li/div/div[1]/a/@href').extract()[0]
item['image'] = product.xpath('ul/li/div/div[1]/a/img/@src').extract()[0]
yield item
您的 xpath 似乎 return 只有 1 个产品用于 products
变量。
尝试:
$ scrapy shell "https://www.ruvilla.com/men/footwear.html?dir=desc&limit=45&order=news_from_date"
In[1]: response.xpath('//div[@class="category-products"]')
Out[1]: [<Selector xpath='//div[@class="category-products"]' data=u'<div class="category-products">\n<div cla'>]
所以你的 xpath 似乎不是针对每个单独的项目,而是针对项目所在的容器。要解决这个问题,你需要生成一个 select 每个 产品[=20] 的 xpath =] 容器改为:
def parse(self, response):
products = Selector(response).xpath('//div[@class="category-products"]//li[contains(@class,"item")]')
for product in products:
item = dict()
item['name'] = product.xpath('.//a/@title').extract_first()
item['link'] = product.xpath('.//a/@href').extract_first()
item['image'] = product.xpath('.//a/img/@src').extract_first()
yield item
next_page = response.xpath("//li[@class='current']/following-sibling::li[1]/a/@href").extract_first()
if next_page:
yield Request(next_page)
你的 xpath 错误。
使用这个 xpath:
('//div[@class="category-products"]/ul/li')