在管道 init scrapy for python 中传递抓取的数据

Question

我正在尝试将包含标题数据的项目传递到我的管道。有没有办法在解析内部解决这个问题，因为下一页的数据会被重置。我试过 super(mySpider,self).__init__(*args,*kwargs) 但数据发送不正确。我需要将网页的标题作为文件名，这就是为什么我需要其中的特定项目。

像这样。

   def __init__(self, item):

      self.csvwriter = csv.writer(open(item['title'][0]+'.csv', 'wb'), delimiter=',')
      self.csvwriter.writerow(['Name','Date','Location','Stars','Subject','Comment','Response','Title'])

Answer 1

ItemPipeline 的工作方式与您想象的不同。

如果您查看 the docs，您会看到：

After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.

这意味着您的 header 与一件物品一起传递，仅与该一件物品一起到达管道。并且默认情况下不保证项目的顺序，因此您不能期望一个项目作为第一个到达管道以设置 header.

另一种方法是标记此特定项目并在您的管道中查找它。如果它没有到达，存储项目直到它到达，写下标题并写下存储的项目。从现在开始，您可以将项目写入 CSV 文件。另一种选择是仅在蜘蛛完成爬行时才写入项目。

但是我想知道为什么您导出的 headers 没有针对您使用的 Spider 进行修复...但是这仍然可能发生。

Answer 2

任何管道的输入都是您的项目。在您的情况下，您需要在项目中传递名称（或任何其他数据）。然后，您应该编写一个管道将该项目写入文件系统（或数据库，或者您可以做任何您想做的事）。

示例代码

假设您的新管道名为 'NewPipeline' 并且位于您的 scrapy 项目的主根目录中。

在您的设置中，您需要将管道定义为：

ITEM_PIPELINES = {
    'YourRootDirectory.NewPipleline.NewPipeline':800
#add any other pipelines you have
}

你的管道应该是这样的：

class NewPipeline(object):
    def process_item(self, item, spider):
        name = item['name']
        self.file = open("pathToWhereYouWantToSave"+ name, 'wb')
        line = json.dumps(dict(item)) #change the item to a json format in one line
        self.file.write(line)#write the item to the file

备注

您可以将管道放在任何其他模块中。

在管道 __init__ scrapy for python 中传递抓取的数据

Passing scraped data in piplines __init__ scrapy for python

python

pipeline

scrapy

示例代码

备注

在管道 init scrapy for python 中传递抓取的数据

Passing scraped data in piplines init scrapy for python