Scrapy句柄302响应码

Question

我正在使用一个简单的 CrawlSpider 实现来抓取网站。默认情况下，Scrapy 跟随 302 重定向到目标位置，并有点忽略最初请求的 link。在特定站点上，我遇到了一个页面，该页面 302 重定向到另一个页面。我的目标是记录原始 link（响应 302）和目标位置（在 HTTP 响应 header 中指定）并在 parse_item 方法中处理它们 CrawlSpider.请指导我，我怎样才能做到这一点？

我遇到了提到使用 dont_redirect=True 或 REDIRECT_ENABLE=False 的解决方案，但我实际上并不想忽略重定向，事实上我也想考虑（即不忽略）重定向页面.

例如：我访问 http://www.example.com/page1，它发送 302 重定向 HTTP 响应并重定向到 http://www.example.com/page2。默认情况下，scrapy 忽略 page1，跟随 page2 并处理它。我想在 parse_item.

中同时处理 page1 和 page2

编辑我已经在 class 蜘蛛定义中使用 handle_httpstatus_list = [500, 404] 来处理 parse_item 中的 500 和 404 响应代码，但同样不适用于 [=27] =] 如果我在 handle_httpstatus_list.

中指定它

Answer 1

重定向中间件将 "catch" 响应到达您的 httperror 中间件并使用重定向 url 启动新请求。同时，不会返回原始响应，即您甚至 "see" 302 代码都不会返回，因为它们没有到达 httperror。因此 handle_httpstatus_list 中的 302 无效。

查看 scrapy.downloadermiddlewares.redirect.RedirectMiddleware 中的来源：在 process_response() 中，您会看到发生了什么。它启动一个新请求并用 redirected_url 替换原来的 URL。否 "return response" -> 原始响应被丢弃。

基本上，除了使用 redirected_url.

发送另一个请求外，您只需添加一行 "return response" 来覆盖 process_response() 函数

在parse_item中，您可能想设置一些条件语句，取决于它是否是重定向？我想它看起来不会完全一样，所以也许您的商品看起来也会大不相同。另一种选择也可能是对任一响应使用不同的解析器（取决于原始或重定向的 url 是 "special pages"），然后您需要的只是具有不同的解析函数，例如 parse_redirected_urls()，在你的蜘蛛中，并通过重定向请求中的回调调用该解析函数

Answer 2

Scrapy 1.0.5（我写这些行时的最新官方版本）不在 built-in RedirectMiddleware 中使用 handle_httpstatus_list -- 参见 this issue。来自 Scrapy 1.1.0 (1.1.0rc1 is available), the issue is fixed.

即使禁用重定向，您仍然可以在回调中模仿它的行为，检查 Location header 并将 Request 返回到重定向

示例蜘蛛：

$ cat redirecttest.py
import scrapy


class RedirectTest(scrapy.Spider):

    name = "redirecttest"
    start_urls = [
        'http://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip'
    ]
    handle_httpstatus_list = [302]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, dont_filter=True, callback=self.parse_page)

    def parse_page(self, response):
        self.logger.debug("(parse_page) response: status=%d, URL=%s" % (response.status, response.url))
        if response.status in (302,) and 'Location' in response.headers:
            self.logger.debug("(parse_page) Location header: %r" % response.headers['Location'])
            yield scrapy.Request(
                response.urljoin(response.headers['Location']),
                callback=self.parse_page)

控制台日志：

$ scrapy runspider redirecttest.py -s REDIRECT_ENABLED=0
[scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
[scrapy] INFO: Optional features available: ssl, http11
[scrapy] INFO: Overridden settings: {'REDIRECT_ENABLED': '0'}
[scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
[scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
[scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
[scrapy] INFO: Enabled item pipelines: 
[scrapy] INFO: Spider opened
[scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/get> (referer: None)
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/get
[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None)
[redirecttest] DEBUG: (parse_page) response: status=302, URL=https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip
[redirecttest] DEBUG: (parse_page) Location header: 'http://httpbin.org/ip'
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip)
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/ip
[scrapy] INFO: Closing spider (finished)

请注意，您需要 http_handlestatus_list 其中包含 302，否则，您将看到这种日志（来自 HttpErrorMiddleware）：

[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None)
[scrapy] DEBUG: Ignoring response <302 https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip>: HTTP status code is not handled or not allowed

Scrapy句柄302响应码

Scrapy handle 302 response code

redirect

web-crawler

scrapy

scrapy-spider