是否允许使用 Scrapy Image Pipeline 进行重复下载？

Question

请查看下面我的代码示例版本，它使用 Scrapy 图像管道从站点 download/scrape 图片：

import scrapy
from scrapy_splash import SplashRequest
from imageExtract.items import ImageextractItem

class ExtractSpider(scrapy.Spider):
    name = 'extract'
    start_urls = ['url']

    def parse(self, response):
        image = ImageextractItem()
        titles = ['a', 'b', 'c', 'd', 'e', 'f']
        rel = ['url1', 'url2', 'url3', 'url4', 'url5', 'url6']

        image['title'] = titles
        image['image_urls'] = rel
        return image

一切正常，但根据默认设置，避免下载重复项。有什么方法可以覆盖它以便我也可以下载副本吗？谢谢

Answer 1

我认为一种可能的解决方案是使用覆盖方法 get_media_requests 创建您自己的从 scrapy.pipelines.images.ImagesPipeline 继承的图像管道（例如，参见 documentation）。在生成 scrapy.Request 的同时，将 dont_filter=True 传递给构造函数。

Answer 2

多亏Tomáš的指导，我终于找到了下载重复图片的方法。

在 class MediaPipeline 的 _process_request 中，我注释了这些行。

# Return cached result if request was already seen # if fp in info.downloaded: # return defer_result(info.downloaded[fp]).addCallbacks(cb, eb)

# Check if request is downloading right now to avoid doing it twice # if fp in info.downloading: # return wad

会发生未捕获的 KeyError，但它似乎不会影响我的结果，所以我停止进一步挖掘。

Answer 3

为了克服Rick提到的KeyError，我做的是：

也在 class MediaPipeline 中查找函数 _cache_result_and_execute_waiters，您将看到类似的 if 情况，如下所示

if isinstance(result, Failure):
   # minimize cached information for failure 
   result.cleanFailure()
   result.frames = []
   result.stack = None

我添加了另一个 if case 来检查 fp 是否在 info.waiting 中，之后的所有内容都在这个 case

中

if fp in info.waiting:
    info.downloading.remove(fp)  
    info.downloaded[fp] = result  # cache result
    for wad in info.waiting.pop(fp):
        defer_result(result).chainDeferred(wad)

在调试日志中，您的 scrapy Item "images" 中的路径名仍然不正确。但是我通过为所有 "image_urls"

创建图像名称列表将其保存在正确的路径中

是否允许使用 Scrapy Image Pipeline 进行重复下载？

Allow duplicate downloads with Scrapy Image Pipeline?

python

pipeline

scrapy