使用 scrapy 按字段排序 json
Order a json by field using scrapy
我创建了一个蜘蛛来从 projecteuler.net 中抓取问题。 我已经用
结束了对相关问题的回答
I launch this with the command scrapy crawl euler -o euler.json and it outputs an array of unordered json objects, everyone corrisponding to a single problem: this is fine for me because I'm going to process it with javascript, even if I think resolving the ordering problem via scrapy can be very simple.
但不幸的是,通过 scrapy 排序要写入 json 的项目(我需要按 id 字段升序排列)似乎不是那么简单。我研究了每一个组件(中间件、管道、导出器、信号等),但似乎没有一个组件对这个目的有用。我得出的结论是,在 scrapy 中根本不存在解决此问题的解决方案(除了可能是一个非常复杂的技巧),并且您被迫在第二阶段订购东西。你同意吗,或者你有什么想法?我在这里复制我的爬虫代码。
蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
from eulerscraper.items import Problem
from scrapy.loader import ItemLoader
class EulerSpider(scrapy.Spider):
name = 'euler'
allowed_domains = ['projecteuler.net']
start_urls = ["https://projecteuler.net/archives"]
def parse(self, response):
numpag = response.css("div.pagination a[href]::text").extract()
maxpag = int(numpag[len(numpag) - 1])
for href in response.css("table#problems_table a::attr(href)").extract():
next_page = "https://projecteuler.net/" + href
yield response.follow(next_page, self.parse_problems)
for i in range(2, maxpag + 1):
next_page = "https://projecteuler.net/archives;page=" + str(i)
yield response.follow(next_page, self.parse_next)
return [scrapy.Request("https://projecteuler.net/archives", self.parse)]
def parse_next(self, response):
for href in response.css("table#problems_table a::attr(href)").extract():
next_page = "https://projecteuler.net/" + href
yield response.follow(next_page, self.parse_problems)
def parse_problems(self, response):
l = ItemLoader(item=Problem(), response=response)
l.add_css("title", "h2")
l.add_css("id", "#problem_info")
l.add_css("content", ".problem_content")
yield l.load_item()
项目:
import re
import scrapy
from scrapy.loader.processors import MapCompose, Compose
from w3lib.html import remove_tags
def extract_first_number(text):
i = re.search('\d+', text)
return int(text[i.start():i.end()])
def array_to_value(element):
return element[0]
class Problem(scrapy.Item):
id = scrapy.Field(
input_processor=MapCompose(remove_tags, extract_first_number),
output_processor=Compose(array_to_value)
)
title = scrapy.Field(input_processor=MapCompose(remove_tags))
content = scrapy.Field()
如果我需要对我的输出文件进行排序(我会假设你有正当理由想要这个),我可能会写一个自定义 exporter.
This 是 Scrapy 的 built-in JsonItemExporter
的实现方式。
通过一些简单的更改,您可以修改它以将项目添加到 export_item()
中的列表,然后对项目进行排序并在 finish_exporting()
中写出文件。
由于您只抓取了几百个项目,因此存储项目列表并且在抓取完成之前不写入文件的缺点对您来说应该不是问题。
到目前为止,我已经找到了一个使用管道的有效解决方案:
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.list_items = []
self.file = open('euler.json', 'w')
def close_spider(self, spider):
ordered_list = [None for i in range(len(self.list_items))]
self.file.write("[\n")
for i in self.list_items:
ordered_list[int(i['id']-1)] = json.dumps(dict(i))
for i in ordered_list:
self.file.write(str(i)+",\n")
self.file.write("]\n")
self.file.close()
def process_item(self, item, spider):
self.list_items.append(item)
return item
虽然它可能不是最优的,因为指南在另一个例子中建议:
The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.
我创建了一个蜘蛛来从 projecteuler.net 中抓取问题。
I launch this with the command scrapy crawl euler -o euler.json and it outputs an array of unordered json objects, everyone corrisponding to a single problem: this is fine for me because I'm going to process it with javascript, even if I think resolving the ordering problem via scrapy can be very simple.
但不幸的是,通过 scrapy 排序要写入 json 的项目(我需要按 id 字段升序排列)似乎不是那么简单。我研究了每一个组件(中间件、管道、导出器、信号等),但似乎没有一个组件对这个目的有用。我得出的结论是,在 scrapy 中根本不存在解决此问题的解决方案(除了可能是一个非常复杂的技巧),并且您被迫在第二阶段订购东西。你同意吗,或者你有什么想法?我在这里复制我的爬虫代码。
蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
from eulerscraper.items import Problem
from scrapy.loader import ItemLoader
class EulerSpider(scrapy.Spider):
name = 'euler'
allowed_domains = ['projecteuler.net']
start_urls = ["https://projecteuler.net/archives"]
def parse(self, response):
numpag = response.css("div.pagination a[href]::text").extract()
maxpag = int(numpag[len(numpag) - 1])
for href in response.css("table#problems_table a::attr(href)").extract():
next_page = "https://projecteuler.net/" + href
yield response.follow(next_page, self.parse_problems)
for i in range(2, maxpag + 1):
next_page = "https://projecteuler.net/archives;page=" + str(i)
yield response.follow(next_page, self.parse_next)
return [scrapy.Request("https://projecteuler.net/archives", self.parse)]
def parse_next(self, response):
for href in response.css("table#problems_table a::attr(href)").extract():
next_page = "https://projecteuler.net/" + href
yield response.follow(next_page, self.parse_problems)
def parse_problems(self, response):
l = ItemLoader(item=Problem(), response=response)
l.add_css("title", "h2")
l.add_css("id", "#problem_info")
l.add_css("content", ".problem_content")
yield l.load_item()
项目:
import re
import scrapy
from scrapy.loader.processors import MapCompose, Compose
from w3lib.html import remove_tags
def extract_first_number(text):
i = re.search('\d+', text)
return int(text[i.start():i.end()])
def array_to_value(element):
return element[0]
class Problem(scrapy.Item):
id = scrapy.Field(
input_processor=MapCompose(remove_tags, extract_first_number),
output_processor=Compose(array_to_value)
)
title = scrapy.Field(input_processor=MapCompose(remove_tags))
content = scrapy.Field()
如果我需要对我的输出文件进行排序(我会假设你有正当理由想要这个),我可能会写一个自定义 exporter.
This 是 Scrapy 的 built-in JsonItemExporter
的实现方式。
通过一些简单的更改,您可以修改它以将项目添加到 export_item()
中的列表,然后对项目进行排序并在 finish_exporting()
中写出文件。
由于您只抓取了几百个项目,因此存储项目列表并且在抓取完成之前不写入文件的缺点对您来说应该不是问题。
到目前为止,我已经找到了一个使用管道的有效解决方案:
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.list_items = []
self.file = open('euler.json', 'w')
def close_spider(self, spider):
ordered_list = [None for i in range(len(self.list_items))]
self.file.write("[\n")
for i in self.list_items:
ordered_list[int(i['id']-1)] = json.dumps(dict(i))
for i in ordered_list:
self.file.write(str(i)+",\n")
self.file.write("]\n")
self.file.close()
def process_item(self, item, spider):
self.list_items.append(item)
return item
虽然它可能不是最优的,因为指南在另一个例子中建议:
The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.