scrapy FilesPipeline 如果 file_urls 再次下载

scrapy FilesPipeline if file_urls ownload again

我是scrapy新手

我使用 FilesPipeline 下载一些 .pdf 文件。

我发现如果Scrapy.Item的file_urls的值相同,则不会重新开始下载过程。

我要的是重新下载

我该如何解决。

谢谢。

_onsuccess 函数添加到您的管道以覆盖它。 (从 FilesPipelines 复制它)。

看起来像这样:

    def media_to_download(self, request, info, *, item=None):
        def _onsuccess(result):
            if not result:
                return  # returning None force download

            last_modified = result.get('last_modified', None)
            if not last_modified:
                return  # returning None force download

            
            age_seconds = time.time() - last_modified
            age_days = age_seconds / 60 / 60 / 24
            if age_days > self.expires:
                return  # returning None force download

            referer = referer_str(request)
            logger.debug(
                'File (uptodate): Downloaded %(medianame)s from %(request)s '
                'referred in <%(referer)s>',
                {'medianame': self.MEDIA_NAME, 'request': request,
                 'referer': referer},
                extra={'spider': info.spider}
            )
    AND MORE CODE HERE THAT I DIDN'T COPY

只需添加 return 即可跳过“更新”部分并继续下载。

    def media_to_download(self, request, info, *, item=None):
        def _onsuccess(result):
            if not result:
                return  # returning None force download

            last_modified = result.get('last_modified', None)
            if not last_modified:
                return  # returning None force download

            
            age_seconds = time.time() - last_modified
            age_days = age_seconds / 60 / 60 / 24
            if age_days > self.expires:
                return  # returning None force download

            return    # force download

(你也可以覆盖FilesPipeline里面的函数,但是我不建议这样做)

此外,请记住激活您的自定义管道并添加您需要的其他功能。

您的 pipelines.py 文件现在应该如下所示:

from itemadapter import ItemAdapter
from scrapy.pipelines.files import FilesPipeline
from scrapy.http import Request
import logging
import time
from twisted.internet import defer
from scrapy.utils.log import failure_to_exc_info

logger = logging.getLogger(__name__)

class TempPipeline():

    def process_item(self, item, spider):
        return item

class ProcessPipeline(FilesPipeline):
    def get_media_requests(self, item, info):
        urls = ItemAdapter(item).get(self.files_urls_field, [])
        return [Request(u) for u in urls]

    def media_to_download(self, request, info, *, item=None):
        def _onsuccess(result):
            if not result:
                return  # returning None force download

            last_modified = result.get('last_modified', None)
            if not last_modified:
                return  # returning None force download

            age_seconds = time.time() - last_modified
            age_days = age_seconds / 60 / 60 / 24
            if age_days > self.expires:
                return  # returning None force download

            return

        path = self.file_path(request, info=info, item=item)
        dfd = defer.maybeDeferred(self.store.stat_file, path, info)
        dfd.addCallbacks(_onsuccess, lambda _: None)
        dfd.addErrback(
            lambda f:
            logger.error(self.__class__.__name__ + '.store.stat_file',
                         exc_info=failure_to_exc_info(f),
                         extra={'spider': info.spider})
        )
        return dfd