Scrapy处理301/302响应码以及跟随目标url

Question

我正在使用 scrapy 版本 1.0.5 来实现爬虫。目前我已经设置 REDIRECT_ENABLED = False 和 handle_httpstatus_list = [500, 301, 302] 来抓取包含 301 和 302 响应的页面。但是，由于 REDIRECT_ENABLED 设置为 False，蜘蛛不会在 Location 响应 header 中到达目标 url。我怎样才能做到这一点？

Answer 1

这是一本很长的书，因为我做了这样的事情，但你需要生成一个带有 url、元数据和回调参数的请求对象。

但我记得你可以按照以下方式进行操作：

def parse(self,response):
    # do whatever you need to do .... then
    if response.status in [301, 302] and 'Location' in response.headers:
        # test to see if it is an absolute or relative URL
        newurl = urljoin(request.url, response.headers['location'])
        # or 
        newurl = response.headers['location']
        yield Request(url = newurl, meta = request.meta, callback=self.parse_whatever)

Scrapy处理301/302响应码以及跟随目标url

Scrapy handle 301/302 response code as well as follow the target url

scrapy

web-scraping

scrapy-spider