如何在scrapy下载器中间件中获取响应体
How to get response body in scrapy downloader middleware
如果在页面上找不到某些 xpath,我需要能够重试请求。所以我写了这个中间件:
class ManualRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if not spider.retry_if_not_found:
return response
if not hasattr(response, 'text') and response.status != 200:
return super(ManualRetryMiddleware, self).process_response(request, response, spider)
found = False
for xpath in spider.retry_if_not_found:
if response.xpath(xpath).extract():
found = True
break
if not found:
return self._retry(request, "Didn't find anything useful", spider)
return response
并在settings.py
中注册:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.ManualRetryMiddleware': 650,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
当我运行蜘蛛时,我得到
AttributeError: 'Response' object has no attribute 'xpath'
我试图手动创建选择器和 运行 xpath...但是响应没有 text
属性 并且 response.body
是字节,而不是 str。 ..
那么如何在中间件中查看页面内容呢?可能某些页面不包含我需要的详细信息,所以我希望稍后再试一次。
response
不包含xpath
方法的原因是下载器中间件process_response
方法中的response
参数是scrapy.http.Response
, see the documentation. Only scrapy.http.TextResponse
(and scrapy.http.HtmlResponse
类型)做有 xpath
方法。所以在使用 xpath
之前,从 response
创建 HtmlResponse
对象。您的 class 的相应部分将变为:
...
new_response = scrapy.http.HtmlResponse(response.url, body=response.body)
if new_response.xpath(xpath).extract():
found = True
break
...
还要注意你的中间件位置。它需要在 scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware
之前,否则,您可能最终会尝试解码压缩数据(这确实不起作用)。检查 response.header 以了解响应是否被压缩 - Content-Encoding: gzip
.
如果在页面上找不到某些 xpath,我需要能够重试请求。所以我写了这个中间件:
class ManualRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if not spider.retry_if_not_found:
return response
if not hasattr(response, 'text') and response.status != 200:
return super(ManualRetryMiddleware, self).process_response(request, response, spider)
found = False
for xpath in spider.retry_if_not_found:
if response.xpath(xpath).extract():
found = True
break
if not found:
return self._retry(request, "Didn't find anything useful", spider)
return response
并在settings.py
中注册:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.ManualRetryMiddleware': 650,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
当我运行蜘蛛时,我得到
AttributeError: 'Response' object has no attribute 'xpath'
我试图手动创建选择器和 运行 xpath...但是响应没有 text
属性 并且 response.body
是字节,而不是 str。 ..
那么如何在中间件中查看页面内容呢?可能某些页面不包含我需要的详细信息,所以我希望稍后再试一次。
response
不包含xpath
方法的原因是下载器中间件process_response
方法中的response
参数是scrapy.http.Response
, see the documentation. Only scrapy.http.TextResponse
(and scrapy.http.HtmlResponse
类型)做有 xpath
方法。所以在使用 xpath
之前,从 response
创建 HtmlResponse
对象。您的 class 的相应部分将变为:
...
new_response = scrapy.http.HtmlResponse(response.url, body=response.body)
if new_response.xpath(xpath).extract():
found = True
break
...
还要注意你的中间件位置。它需要在 scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware
之前,否则,您可能最终会尝试解码压缩数据(这确实不起作用)。检查 response.header 以了解响应是否被压缩 - Content-Encoding: gzip
.