Python Scrapy 网络抓取和抓取

Question

我正在写一个 Scrapy Spider 来遍历一个旅游网站。网站结构如下：

Continents
    North America
        USA
            lat: 123
            long: 456
        Canada
            lat: 123
            long: 456
    South America
        Brazil
            lat: 456
            long: 789
        Peru
            lat: 123
            long: 456

我已经弄清楚如何使用下面的脚本抓取每个国家/地区页面并获取 lat/long 信息，但我在存储信息方面遇到了困难。

import scrapy


class WorldSpider(scrapy.Spider):
    name = "world"

    def start_requests(self):
        urls = [
            'www.world.com'
        ]
        for url in urls:
            # yield scrapy.Request(url=url, callback=self.parse)
            yield scrapy.Request(url=url, callback=self.parse_region)

    def parse(self, response):
        for link in response.css(CONTINENT_SELECTOR):
            continent = link.css('a::attr(href)').extract_first()
            if continent is not None:
                continent = response.urljoin(continent)
                yield response.follow(continent, callback=self.parse_continent)

    def parse_continent(self, continent_response):
        country_urls = continent_response.css(COUNTRY_SELECTOR)
        if len(country_urls) == 0:
            # This if-statement is entered when the Spider is at a country web page (e.g. USA, Canada, etc.).
            # TODO figure out how to store this to text file or append to JSON object
            yield {
                'country': continent_response.css(TITLE_SELECTOR).extract_first(),
                'latitude' : continent_response.css(LATITUDE_SELECTOR).extract_first(),
                'longitude' : continent_response.css(LONGITUDE_SELECTOR).extract_first()
            }

        for link in country_urls:
            country = link.css('a::attr(href)').extract_first()
            if area is not None:
                yield continent_response.follow(continent_response.urljoin(area), callback=self.parse_continent)

如何将此信息写入文件或 JSON 对象？理想情况下，我希望数据结构能够捕获网站的结构。

示例：

{
    "continents": [
        {"North America" : [
            {"country" : {"title": "USA", "latitude" : 123, "longitude" : 456}},
            {"country" : {"title": "Canada", "latitude" : 123, "longitude" : 456}}
        ]},
        {"South America" : [
            {"country" : {"title": "Brazil", "latitude" : 456, "longitude" : 789}},
            {"Peru" : {"title": "Peru", "latitude" : 123, "longitude" : 456}}
        ]}          
    ]
}

我应该如何修改我的 Spider 以实现上述目标？

Answer 1

Scrapy 通过 Feed Exports 提供开箱即用的功能，它允许您使用多种序列化格式和存储后端生成包含已抓取项目的 Feed。

scrapy crawl WorldSpider -o name.json -t json

将保存已解析的项目。

Answer 2

可以通过两种方式将数据存储在文件中。首先是@Jan 提到的，使用 JsonWritePipeline，当 scrapy spider 多次运行并且每次都用于附加到文件时，建议使用这种方法。

以下是此类执行的示例：

with open(filename, 'a') as f:
          f.write(response.body)
self.log('Saved file %s' % filename)

不过最简单的方法是使用 Feed Export 选项，这样可以更轻松地实施。

Feed Exports which allows you to generate a feed with the scraped items, using multiple serialization formats and storage backends. For serializing the scraped data, the feed exports use the Item exporters. These formats are supported out of the box:
    JSON
    JSON lines
    CSV
    XML

以下是使用 FileExport 将数据存储为 JSON 文件的示例：

$scrapy crawl myExample -o output.json

Note : Scrapy appends to a given file instead of overwriting its contents. If you run this command twice without removing the file before the second time, you’ll end up with a broken JSON file.

至于 JSON 中数据的结构，我更喜欢使用 Item，因为它为您提供了一个非常清晰的结构和许多深度 JSONs 更适合验证结构。

对于您的实施，结构应声明为：

import scrapy

class Address(scrapy.Item):
    title = scrapy.Field()
    latitude = scrapy.Field()
    longitude = scrapy.Field()

class Place(scrapy.Item):
    country = scrapy.Field()         #object of Address

class Continents(scrapy.Item):
    name = scrapy.Field()             #array of Place

我会让你弄清楚如何实现它;-)

Python Scrapy 网络抓取和抓取

Python Scrapy web-crawling and scraping

python

web-crawler

scrapy

web-scraping

scrapy-spider