在 Scrapy 中激活一个管道组件来写入 JSON
Activating a Pipeline Component in Scrapy to write JSON
我试图将抓取的项目保存在单独的 json 文件中,但我没有看到任何输出文件。管道和项目在 scrapy 项目文件夹中的 piplines.py 和 items.py 文件中定义。我必须显式调用 process_item() 还是当我在 scrape() 中 return 项目时会自动调用它?我在 CrawlerProcess(settings={'ITEM_PIPELINES'}) 中启用了管道。谢谢
流水线
import json,datetime
class JsonWriterPipeline(object):
def process_item(self, item, spider):
# return item
fileName = datetime.datetime.now().strftime("%Y%m%d%H%M%S") + '.json'
try:
with open(fileName,'w') as fp:
json.dump(dict(item),fp)
return item
except:
return item
class ProjectItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
class mySpider(CrawlSpider):
name = 'mySPider'
allowed_domains = ['allowedDOmain.org']
start_urls = ['https://url.org']
def parse(self,response):
monthSelector = '//div[@class="archives-column"]/ul/li/a[contains(text(),"November 2019")]/@href'
monthLink = response.xpath(monthSelector).extract_first()
yield response.follow(monthLink,callback=self.scrape)
def scrape(self,response):
# get the links to all individual articles
linkSelector = '.entry-title a::attr(href)'
allLinks = response.css(linkSelector).extract()
for link in allLinks:
# item = articleItem()
item = ProjectItem()
item['url'] = link
request = response.follow(link,callback=self.getContent)
request.meta['item'] = item
item = request.meta['item']
yield item
nextPageSelector = 'span.page-link a::attr(href)'
nextPageLink = response.css(nextPageSelector).extract_first()
yield response.follow(nextPageLink,callback=self.scrape)
def getContent(self,response):
item = response.meta['item']
TITLE_SELECTOR = '.entry-title ::text'
item['title'] = response.css(TITLE_SELECTOR).extract_first()
yield item
至settings.py,添加:
ITEM_PIPELINES = {
'myproject.pipelines.JsonWriterPipeline':100
}
其中 myproject 是您的 project/folder.
的名称
查看本页的最后一个标题:https://docs.scrapy.org/en/latest/topics/item-pipeline.html
当运行脚本中有蜘蛛时,需要使用下面介绍的方法导入设置。 Running scrapy from script not including pipeline
我试图将抓取的项目保存在单独的 json 文件中,但我没有看到任何输出文件。管道和项目在 scrapy 项目文件夹中的 piplines.py 和 items.py 文件中定义。我必须显式调用 process_item() 还是当我在 scrape() 中 return 项目时会自动调用它?我在 CrawlerProcess(settings={'ITEM_PIPELINES'}) 中启用了管道。谢谢
流水线
import json,datetime
class JsonWriterPipeline(object):
def process_item(self, item, spider):
# return item
fileName = datetime.datetime.now().strftime("%Y%m%d%H%M%S") + '.json'
try:
with open(fileName,'w') as fp:
json.dump(dict(item),fp)
return item
except:
return item
class ProjectItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
class mySpider(CrawlSpider):
name = 'mySPider'
allowed_domains = ['allowedDOmain.org']
start_urls = ['https://url.org']
def parse(self,response):
monthSelector = '//div[@class="archives-column"]/ul/li/a[contains(text(),"November 2019")]/@href'
monthLink = response.xpath(monthSelector).extract_first()
yield response.follow(monthLink,callback=self.scrape)
def scrape(self,response):
# get the links to all individual articles
linkSelector = '.entry-title a::attr(href)'
allLinks = response.css(linkSelector).extract()
for link in allLinks:
# item = articleItem()
item = ProjectItem()
item['url'] = link
request = response.follow(link,callback=self.getContent)
request.meta['item'] = item
item = request.meta['item']
yield item
nextPageSelector = 'span.page-link a::attr(href)'
nextPageLink = response.css(nextPageSelector).extract_first()
yield response.follow(nextPageLink,callback=self.scrape)
def getContent(self,response):
item = response.meta['item']
TITLE_SELECTOR = '.entry-title ::text'
item['title'] = response.css(TITLE_SELECTOR).extract_first()
yield item
至settings.py,添加:
ITEM_PIPELINES = {
'myproject.pipelines.JsonWriterPipeline':100
}
其中 myproject 是您的 project/folder.
的名称查看本页的最后一个标题:https://docs.scrapy.org/en/latest/topics/item-pipeline.html
当运行脚本中有蜘蛛时,需要使用下面介绍的方法导入设置。 Running scrapy from script not including pipeline