将多个蜘蛛的输出捕获到单个输出文件/变量中的正确 'scrapy' 方法是什么？

Question

我是 Scrapy 的新手，正在尝试了解我如何输出数据 post-抓取。我已经阅读了一些文档，但我很难理解它，而且可用的文档对于我的用例来说有点模糊。

基本上，我同时启动了大约九个蜘蛛，以从给定网站上抓取某些信息。我按如下方式启动抓取：

from scrapy.utils import project
from scrapy.crawler import CrawlerProcess

# Initialise a crawler
crawl_process = CrawlerProcess(project.get_project_settings())

# Locate and iterate through the spiders
for spider in (x for x in crawl_process.spider_loader.list()):
    crawl_process.crawl(spider)

# Kick it off
crawl_process.start()

这些蜘蛛的输出是每个蜘蛛的一个字典（如果有匹配），类似于以下内容：

yield {
    'Retailer': 'Amazon',
    'Item': 'Product Example',
    'Price': '£62.50',
    'URL': 'URL'
}

我想尽可能避免写入磁盘，因为这个脚本将运行在我的 Raspberry Pi 的 SD 卡上，所以每两个一次写入数十个文件大约几分钟并不理想。

我的要求：

我希望能够找到一种方法来收集内存中的数据，直到所有的蜘蛛都完成，然后我需要能够创建一个字典列表进行比较，最后输出到一个单一的json 文件。这是可能吗？真是摸不着头脑。

如果不可能，我真的很想知道如何正确处理它。

Answer 1

利用 Scrapy Pipeline + MongoDB Cloud 是一个很好的选择。将流水线用于任何逻辑，并将您需要的内容写入 MongoDB.

您也可以只将所有内容添加到 MongoDB 并使用 Pymongo 和一些简单的脚本编写所有逻辑。

选项 2 的管道类似于：

import pymongo
from itemadapter import ItemAdapter

class MongoPipeline:

  collection_name = 'scrapy_items'


  def open_spider(self, spider):
      self.client = pymongo.MongoClient(os.environ.get("MONGO_DB_CONNECTION_STRING"))
      self.db = self.client["db name"]

  def close_spider(self, spider):
      self.client.close()

  def process_item(self, item, spider):
      self.db[self.collection_name].insert_one(ItemAdapter(item).asdict())
      return item

MongoDB 云最低层是免费的，并且可能能够在项目的整个生命周期内处理您的数据量（考虑到您使用的是 rasp pi）。

Answer 2

您可以在 class 变量中累积您的项目，最后将其写入磁盘：

import json
from itemadapter import ItemAdapter

class CustomPipeline:

  data = []

  def close_spider(self, spider):
      # write your self.data to disk here

  def process_item(self, item, spider):
      self.data.append(ItemAdapter(item).asdict())
      return item

Answer 3

对于可能遇到此问题的任何其他人，我在 Whosebug 评论中找到了另一个答案，它监视 items_scraped 信号，允许您根据需要与每个项目进行交互。

所以我的过程是设置每个蜘蛛爬行，然后将每个蜘蛛的结果添加到列表中，并在蜘蛛完成后将其返回到我的进程队列中。

results = []

# Method called every time item_passed signal is received
def crawler_results(signal, sender, item, response, spider):
    results.append(item)

# Connect to item_passed signal
dispatcher.connect(crawler_results, signal=signals.item_scraped)

# Initialise a crawler
crawl_process = CrawlerProcess(project.get_project_settings())

# Locate and iterate through the spiders
for spider in (x for x in crawl_process.spider_loader.list()):
    crawl_process.crawl(spider)

# Kick it off
crawl_process.start()

# Output results to queue
queue.put(results)

将多个蜘蛛的输出捕获到单个输出文件/变量中的正确 'scrapy' 方法是什么？

What is the correct 'scrapy' way to capture the output of multiple spiders into a single output file / variable?

python

json

scrapy

web-scraping