在 Scrapy 中激活一个管道组件来写入 JSON

Activating a Pipeline Component in Scrapy to write JSON

我试图将抓取的项目保存在单独的 json 文件中,但我没有看到任何输出文件。管道和项目在 scrapy 项目文件夹中的 piplines.py 和 items.py 文件中定义。我必须显式调用 process_item() 还是当我在 scrape() 中 return 项目时会自动调用它?我在 CrawlerProcess(settings={'ITEM_PIPELINES'}) 中启用了管道。谢谢

流水线

import json,datetime

class JsonWriterPipeline(object):
    def process_item(self, item, spider):
        # return item
        fileName = datetime.datetime.now().strftime("%Y%m%d%H%M%S") + '.json'
        try:
            with open(fileName,'w') as fp:
                json.dump(dict(item),fp)
                return item
        except:
            return item
class ProjectItem(scrapy.Item):
   title = scrapy.Field()
   url = scrapy.Field()
class mySpider(CrawlSpider):
   name = 'mySPider'
   allowed_domains = ['allowedDOmain.org']
   start_urls = ['https://url.org']

def parse(self,response):
        monthSelector = '//div[@class="archives-column"]/ul/li/a[contains(text(),"November 2019")]/@href'
        monthLink = response.xpath(monthSelector).extract_first()
        yield response.follow(monthLink,callback=self.scrape)

def scrape(self,response):
        # get the links to all individual articles
        linkSelector = '.entry-title a::attr(href)'
        allLinks = response.css(linkSelector).extract()

        for link in allLinks:
            # item = articleItem()
            item = ProjectItem()
            item['url'] = link
            request = response.follow(link,callback=self.getContent)
            request.meta['item'] = item
            item = request.meta['item']
            yield item

        nextPageSelector = 'span.page-link a::attr(href)'
        nextPageLink = response.css(nextPageSelector).extract_first()
        yield response.follow(nextPageLink,callback=self.scrape)

def getContent(self,response):
        item = response.meta['item']
        TITLE_SELECTOR = '.entry-title ::text'
        item['title'] = response.css(TITLE_SELECTOR).extract_first()
        yield item

至settings.py,添加:

ITEM_PIPELINES = {
        'myproject.pipelines.JsonWriterPipeline':100
}

其中 myproject 是您的 project/folder.

的名称

查看本页的最后一个标题:https://docs.scrapy.org/en/latest/topics/item-pipeline.html

当运行脚本中有蜘蛛时,需要使用下面介绍的方法导入设置。 Running scrapy from script not including pipeline