Scrapy (Python) - 跳过网站部分

Question

我正在研究从商店网站收集数据的网络爬虫 https://www.promocje.zabka.pl。有时会发生产品没有以正常方式附加价格的情况（即在 span "product-price-integer" 中）。由于数组大小不同，它会在循环中引起麻烦。我可以将迭代限制为最短数组的长度，但它可能会扭曲结果（偏移）。这是我的问题：如果其中一个字段（价格整数、价格小数）为空或不存在，我如何才能跳过整个 div 产品？

这是我的代码（请注意，通常不需要金额，因为我可以从产品名称中提取更准确的值；这就是 if 的原因）

class MySpider(BaseSpider):
    name        = "zabka"
    allowed_domains = ["zabka.pl"]
    start_urls  = ["http://promocje.zabka.pl"]

    def parse(self, response):
        hxs     = HtmlXPathSelector(response)
        titles  = hxs.select('//div[@class="product-description"]/h2/text()').extract()
        prices  = hxs.select('//span[@class="product-price-integer"]/text()').extract()
        prices1 = hxs.select('//span[@class="product-price-decimal"]/text()').extract()
        amounts = hxs.select('//p[@class="product-unit"]/text()').extract()

        list = []

        for i in range(len(titles)):
            split = titles[i].split(",")
            if len(split) > 1:
                list1 = split[0] + "  -  " + prices[i] + "," + prices1[i] + "  -  " + split[len(split) - 1]
            else:
                list1 = titles[i] + "  -  " + prices[i] + "," + prices1[i] + "  -  " + amounts[i]
            list.append(list1)

        for l in list:
            item = NettutsItem()
            item["title"] = l
            yield item

Answer 1

我试过了

scrapy shell http://promocje.zabka.pl/

主要思想是您应该使用 div 作为下一个选择器的起点。

In [32]: div  = response.xpath('//div[@class="product-description"]')[0]
In [33]: div.css('h2::text').extract()
Out[33]: [u'Babka marmur Dan Cake, 40 g']

In [34]: div.css('span[class="product-price-integer"]::text').extract()
Out[34]: [u'1']

试试这个方法

def parse(self, response):
list = []
for div in response.xpath('//div[@class="product-description"]'):
    price = div.css('span[class="product-price-integer"]::text').extract()
    price1 = div.css('span[class="product-price-decimal"]::text').extract()
    title = div.css('h2::text').extract()[0].split(",")
    amount = div.css('p[class="product-unit"]::text').extract()[0]

    if price and price1:
        price = price[0]
        price1 = price1[0]
        item = NettutsItem()
        # do some stuff with item
        yield item

代码可能更漂亮。

Scrapy (Python) - 跳过网站部分

Scrapy (Python) - skipping section of site

python

web-crawler

scrapy