为什么 scrapy 不遍历页面上的所有链接，即使 xpath 是正确的？

Question

这段代码在我通过 extract()[0] 或 extract() 时工作得很好 - 它为我提供了第一个 link 的输出 parsed.I 我无法理解为什么它这样做，bcs 当我使用这段代码爬取其他网站时，它非常好。

对于这个网站，它只抓取第一个 link.If 我更改 extract()[1] 然后它会给我第二个 link 等等。为什么它不能在 for 循环中自动工作？

import scrapy
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


class CompItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    data = scrapy.Field()
    name = scrapy.Field()
    date = scrapy.Field()



class criticspider(BaseSpider):
    name = "mmt_mouth"
    allowed_domains = ["mouthshut.com"]
    start_urls = ["http://www.mouthshut.com/websites/makemytripcom-reviews-925031929"]
    # rules = (
        # Rule(
            # SgmlLinkExtractor(allow=("search=make-my-trip&page=1/+",)),
            # callback="parse",
            # follow=True),
    # )

    def parse(self, response):
        sites = response.xpath('//div[@id="allreviews"]')
        items = []

        for site in sites:
            item = CompItem()
            item['name'] = site.xpath('.//li[@class="profile"]/div/a/span/text()').extract()[0]
            item['title'] = site.xpath('.//div[@class="reviewtitle fl"]/strong/a/text()').extract()[0]
            item['date'] = site.xpath('.//div[@class="reviewrate"]//span[@class="datetime"]/span/span/span/text()').extract()[0]
            item['link'] = site.xpath('.//div[@class="reviewtitle fl"]/strong/a/@href').extract()[0]
            if item['link']:
                if 'http://' not in item['link']:
                    item['link'] = urljoin(response.url, item['link'])
                yield scrapy.Request(item['link'],
                                    meta={'item': item},
                                    callback=self.anchor_page)

            items.append(item)

    def anchor_page(self, response):
        old_item = response.request.meta['item']

        old_item['data'] = response.xpath('.//div[@itemprop="description"]/p/text()').extract()
        yield old_item

Answer 1

因为您的 for 循环在给定的网站上没有要循环的内容。更改您的声明

sites = response.xpath('//div[@id="allreviews"]')

至

sites = response.xpath('//div[@id="allreviews"]/ul/li')

然后您的 for 循环可以遍历列表元素。

为什么 scrapy 不遍历页面上的所有链接，即使 xpath 是正确的？

Why scrapy not iterating over all the links on the page even the xpaths are correct?

xpath

web-crawler

scrapy

web-scraping

python-2.7