Scrapy 创建 XML 提要将内容包装在 "value" 标签中

Scrapy creating XML feed wraps content in "value" tags

我的代码非常有效,在这里我得到了一些帮助。唯一的问题是,在生成 XML 的过程中,它会在我不希望将内容包装在 "value" 标签中。根据文档,这是由于:

Unless overriden in the :meth:serialize_field method, multi-valued fields are exported by serializing each value inside a <value> element. This is for convenience, as multi-valued fields are very common.

这是我的输出:

<?xml version="1.0" encoding="UTF-8"?>
<items>
   <item>
      <body>
         <value>Don't forget me this weekend!</value>
      </body>
      <to>
         <value>Tove</value>
      </to>
      <who>
         <value>Jani</value>
      </who>
      <heading>
         <value>Reminder</value>
      </heading>
   </item>
</items>

我发给XML导出器的好像是这个,不知道为什么它认为是多值的?

{'body': [u"Don't forget me this weekend!"],
 'heading': [u'Reminder'],
 'to': [u'Tove'],
 'who': [u'Jani']}

pipeline.py

from scrapy import signals
from scrapy.contrib.exporter import XmlItemExporter

class XmlExportPipeline(object):

    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
         pipeline = cls()
         crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
         crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
         return pipeline

    def spider_opened(self, spider):
        file = open('%s_products.xml' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = XmlItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

spider.py

from scrapy.contrib.spiders import XMLFeedSpider
from crawler.items import CrawlerItem

class SiteSpider(XMLFeedSpider):
    name = 'site'
    allowed_domains = ['www.w3schools.com']
    start_urls = ['http://www.w3schools.com/xml/note.xml']
    itertag = 'note'

    def parse_node(self, response, selector):
        item = CrawlerItem()
        item['to'] = selector.xpath('//to/text()').extract()
        item['who'] = selector.xpath('//from/text()').extract()
        item['heading'] = selector.xpath('//heading/text()').extract()
        item['body'] = selector.xpath('//body/text()').extract()
        return item

任何帮助将不胜感激。我只想要没有冗余标签的相同输出。

extract() 方法将始终 return 值列表,即使结果只有一个值,例如:[4][3,4,5]None。 为避免这种情况,如果您知道只有一个值,您可以 select 像这样:

item['to'] = selector.xpath('//to/text()').extract()[0]

注意: 请注意,这可能会导致在 extract() returns None 情况下抛出异常,而您正试图对其进行索引。在这种不确定的情况下,这是一个很好的技巧:

item['to'] = (selector.xpath('...').extract() or [''])[0]

或者您可以编写自定义函数来获取第一个元素:

def extract_first(selector, default=None):
    val = selector.extract()
    return val[0] if val else default

这样您就可以获得默认值,以防找不到您想要的值:

item['to'] = extract_first(selector.xpath(...))  # First or none
item['to'] = extract_first(selector.xpath(...), 'not-found')  # First of 'not-found'

关于为什么会发生这种情况,上面的答案是正确的,但我想补充一点,现在有开箱即用的支持,不需要编写辅助方法。

item['to'] = selector.xpath('//to/text()').extract_first()

item['to'] = selector.xpath('//to/text()').extract_first(default='spam')