如何使用非内置存储 URI 参数在 Scrapy 中自定义 URI

Question

我想将 Scrapy 提要 URI 自定义为 s3 以包含上传文件的尺寸。目前我在 settings.py 文件中有以下内容：

FEEDS = {
    's3://path-to-file/file_to_have_dimensions.csv': {
        'format': 'csv',
        'encoding': 'utf8',
        'store_empty': False,
        'indent': 4,
    }
}

但是想要像下面这样的东西：

NUMBER_OF_ROWS_IN_CSV = file.height()
    FEEDS = {
        f's3://path-to-files/file_to_have_dimensions_{NUMBER_OF_ROWS_IN_CSV}.csv': {
            'format': 'csv',
            'encoding': 'utf8',
            'store_empty': False,
            'indent': 4,
        }
    }

请注意，我希望自动插入行数。

仅通过更改 settings.py 是否可以做到这一点，还是需要更改 scrapy 代码的其他部分？

Answer 1

提要文件是在蜘蛛程序启动时创建的运行ning，此时项目的数量尚不清楚。但是，当蜘蛛完成运行ning 时，它会调用一个名为 closed 的方法，您可以从该方法访问蜘蛛统计信息、设置，还可以执行您想要运行的任何其他任务在蜘蛛完成抓取和保存项目后。

在下面的例子中，我将提要文件从 intial_file.csv 重命名为 final_file_{item_count}.csv。

因为你不能在 s3 中重命名文件，我使用 boto3 库将 initial_file 复制到一个新文件并使用文件名中包含的 item_count 值命名它然后删除初始文件。

import scrapy
import boto3

class SampleSpider(scrapy.Spider):
    name = 'sample'
    start_urls = [
        'http://quotes.toscrape.com/',
    ]

    custom_settings = {
        'FEEDS': {
            's3://path-to-file/initial_file.csv': {
                'format': 'csv',
                'encoding': 'utf8',
                'store_empty': False,
                'indent': 4,
            }
        }
    }

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'):
            yield {
                'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
                'tags': quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').extract()
            }

    def closed(self, reason):
        item_count = self.crawler.stats.get_value('item_scraped_count')
        try:
            session = boto3.Session(aws_access_key_id = 'awsAccessKey', aws_secret_access_key = 'awsSecretAccessKey')
            s3 = session.resource('s3')
            s3.Object('my_bucket', f'path-to-file/final_file_{item_count}.csv').copy_from(CopySource = 'my_bucket/path-to-file/initial_file.csv')
            s3.Object('my_bucket', 'path-to-file/initial_file.csv').delete()
        except:
            self.logger.info("unable to rename s3 file")

如何使用非内置存储 URI 参数在 Scrapy 中自定义 URI

How to customize URI in Scrapy with non built in storage URI parameters

scrapy