Scrapy 没有给出 phone 的所有评论的个别结果?

Scrapy not giving individual results of all the reviews of a phone?

这段代码给出了结果,但输出不是我想要的。我的 xpath 有什么问题?如何通过 +10 迭代规则。这两个我一直有问题。

    import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


class CompItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    data = scrapy.Field()
    name_reviewer = scrapy.Field()
    date = scrapy.Field()
    model_name = scrapy.Field()
    rating = scrapy.Field()
    review = scrapy.Field()



class criticspider(CrawlSpider):
    name = "flip_review"
    allowed_domains = ["flipkart.com"]

    start_urls = ['http://www.flipkart.com/samsung-galaxy-s5/product-reviews/ITME5Z9GKXGMFSF6?pid=MOBDUUDTADHVQZXG&type=all']
    rules = (
        Rule(
            SgmlLinkExtractor(allow=('.*\&start=.*',)),
            callback="parse_start_url",
            follow=True),
    )

    def parse_start_url(self, response):
        sites = response.css('div.review-list div[review-id]')
        items = []
        model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')
        for site in sites:
            item = CompItem()
            item['model_name'] = model_name
            item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract())
            item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()
            item['title'] = site.xpath('.//div[contains(@class,"line fk-font-normal bmargin5 dark-gray")]/strong/text()').extract()
            item['review'] = site.xpath('.//span[contains(@class,"review-text")]/text()').extract()
            yield item

我的输出是:

 {'date': [u'\n 31 Mar 2015 ', u'\n 23 Mar 2015 '],
  'model_name': [u'\n Reviews of A & K 333 '],
  'name_reviewer': [u'\n pradeep kumar', u'\n vikas agrawal']}

我希望我的输出是:

{model_name :xyz
name_reviewer :abc
date:38383
}
{model_name :xyz
name_reviewer :hfhd
date:9283
}

我认为问题出在我的 XPath 上。

这应该有帮助,这是您的 xpath

的问题
In [1]: data_list = []

In [2]: sites = response.xpath('//div[@class="review-list"]/div')

In [3]: for site in sites:
    data = {}
    data['name_reviewer'] = site.xpath('./div/div[@class="line"]/span[@class="fk-color-title fk-font-11 review-username"]/text()|./div/div[@class="line"]/a[@class="load-user-widget fk-underline"]/text()').extract()[0].strip()
    data['date'] = site.xpath('./div/div[@class="date line fk-font-small"]/text()').extract()[0].strip()
    data['model_name'] =  response.xpath('//h1[@class="title"]/text()').extract()[0].strip()
    data_list.append(data)


In [4]: data_list
Out[4]: 
[{'date': u'10 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'RISHABH GROVER'},
 {'date': u'11 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Hemraj Chaudhari'},
 {'date': u'28 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'RISHABH GROVER'},
 {'date': u'27 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Debadutta Patnaik'},
 {'date': u'24 May 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Joel'},
 {'date': u'11 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Saswat Nayak'},
 {'date': u'14 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Amit Thakor'},
 {'date': u'28 May 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Nishchal Sharma'},
 {'date': u'13 May 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'siddiq hassan'},
 {'date': u'16 May 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Raja Shekhar'}]

首先,您的 XPath 表达式通常非常脆弱

您的方法的主要问题是 site 不包含评论部分,但它应该包含。换句话说,您不会遍历页面上的评论块。

另外,模型名称应该在循环之外提取,因为它对于页面上的每条评论都是相同的。我还会使用 .re() 从标题中提取模型名称,例如SAMSUNG GALAXY S5REVIEWS OF SAMSUNG GALAXY S5

这是应用了修复的完整工作代码:

def parse_start_url(self, response):
    sites = response.css('div.review-list div[review-id]')

    model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')[0].strip()
    for site in sites:
        item = CompItem()
        item['model_name'] = model_name
        item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract()).strip()
        item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()[0].strip()
        yield item

XPath 表达式也变得更简单。举个例子,评论部分由 CSS 选择器 div.review-list div[review-id] 标识,该选择器将匹配 div 下任何包含 review-id 属性的所有 div 元素有 review-list class.

此外,请注意 name_reviewer 是如何提取的 - 由于有不同的用户,其中一些表示为配置文件 link,一些未注册并位于 spanreview-username class - 我采用了不同的方法:定位审查日期并获取前面第一个兄弟姐妹的文本。


我想指出 class 名称,如 linefk-font-smallfk-font-11 等 layout-oriented classes并且,一般来说,依赖 XPath 表达式和 CSS 选择器并不是一个好的选择。请注意,classes 用于定位答案中的元素:review-listtitledate - 它们更多 data-oriented 并且是您的更好选择定位器。