Scrapy 也将图像数组保存在 json 文件中，而不仅仅是 url

Question

在学习了如何使用 scrapy 正确下载图像之后，我现在正在尝试生成一个只有图像 url 的干净 json 文件，但是 scrapy 还保存了一个我不知道的空图像数组关心一下。

def parse(self, response):
        raw_image_urls = response.xpath(".//img/@src").getall()
        clean_image_urls = []
        for img_url in raw_image_urls:
            clean_image_urls.append(response.urljoin(img_url))
        for clear_url in clean_image_urls:
            yield {
                'image_url': clear_url,  
            }

这会产生：

{"image_url": "https://image.shutterstock.com/image-photo/deep-forest-river-wild-waterfall-260nw-1585363855.jpg", "images": []},

而不只是：

{"image_url": "https://image.shutterstock.com/image-photo/deep-forest-river-wild-waterfall-260nw-1585363855.jpg"},

这就是我需要的。

我这样修改了管道：

class customImagePipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        return request.url.split('/')[-1]

哪个应该给我图片正确的名字。

Answer 1

ImagesPipeline.item_completed 设置该字段，因此您需要覆盖它以使其不执行任何操作：

    def item_completed(self, results, item, info):
        return item

Scrapy 也将图像数组保存在 json 文件中，而不仅仅是 url

Scrapy saves array of images in json file too instead of only the url

python

web-crawler

scrapy