为什么这个 scrapy proxymiddleware 会重复请求?
Why does this scrapy proxymiddleware make duplicated requests?
我想用proxymiddleware给spider添加代理,但是不知道为什么过滤了重复的请求
代码如下:
class TaylorSpider(CrawlSpider):
name = 'Taylor'
allowed_domains = ['tandfonline.com']
start_urls = ['http://www.tandfonline.com/action/cookieAbsent']
def start_requests(self):
yield Request(self.start_urls[0], dont_filter=True, callback = self.parse_start_url)
def parse_start_url(self, response):
item = TaylorspiderItem()
item['PageUrl'] = response.url
yield item
# middleware.py
class ProxyMiddleware(object):
def process_request(self, request, spider):
logger.info('pr........................')
request.meta['proxy'] = 'http://58.16.86.239:8080'
return request
# setting.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'TaylorSpider.middlewares.ProxyMiddleware': 100,
}
当dont_filter=True
时,陷入死循环,日志为
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
但是当 dont_filter=False
时,日志是
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider opened
2017-07-19 13:54:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-19 13:54:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-07-19 13:54:25 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:54:25 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.tandfonline.com/action/cookieAbsent> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-19 13:54:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 422000),
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'log_count/WARNING': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 414000)}
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider closed (finished)
那我该如何解决呢?
Downloader middlewares' process_request
应该 return None
如果他们只修补请求并希望框架继续处理:
process_request() should either: return None, return a Response
object, return a Request object, or raise IgnoreRequest.
If it returns None, Scrapy will continue processing this request,
executing all other middlewares until, finally, the appropriate
downloader handler is called the request performed (and its response
downloaded).
(...)
If it returns a Request object, Scrapy will stop calling
process_request methods and reschedule the returned request. Once the
newly returned request is performed, the appropriate middleware chain
will be called on the downloaded response.
所以您想在 process_request
的末尾删除 return request
。
我想用proxymiddleware给spider添加代理,但是不知道为什么过滤了重复的请求
代码如下:
class TaylorSpider(CrawlSpider):
name = 'Taylor'
allowed_domains = ['tandfonline.com']
start_urls = ['http://www.tandfonline.com/action/cookieAbsent']
def start_requests(self):
yield Request(self.start_urls[0], dont_filter=True, callback = self.parse_start_url)
def parse_start_url(self, response):
item = TaylorspiderItem()
item['PageUrl'] = response.url
yield item
# middleware.py
class ProxyMiddleware(object):
def process_request(self, request, spider):
logger.info('pr........................')
request.meta['proxy'] = 'http://58.16.86.239:8080'
return request
# setting.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
'TaylorSpider.middlewares.ProxyMiddleware': 100,
}
当dont_filter=True
时,陷入死循环,日志为
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:56:21 [TaylorSpider.middlewares] INFO: pr........................
但是当 dont_filter=False
时,日志是
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider opened
2017-07-19 13:54:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-19 13:54:25 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-07-19 13:54:25 [TaylorSpider.middlewares] INFO: pr........................
2017-07-19 13:54:25 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.tandfonline.com/action/cookieAbsent> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-19 13:54:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'dupefilter/filtered': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 422000),
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'log_count/WARNING': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 7, 19, 5, 54, 25, 414000)}
2017-07-19 13:54:25 [scrapy.core.engine] INFO: Spider closed (finished)
那我该如何解决呢?
Downloader middlewares' process_request
应该 return None
如果他们只修补请求并希望框架继续处理:
process_request() should either: return None, return a Response object, return a Request object, or raise IgnoreRequest.
If it returns None, Scrapy will continue processing this request, executing all other middlewares until, finally, the appropriate downloader handler is called the request performed (and its response downloaded).
(...)
If it returns a Request object, Scrapy will stop calling process_request methods and reschedule the returned request. Once the newly returned request is performed, the appropriate middleware chain will be called on the downloaded response.
所以您想在 process_request
的末尾删除 return request
。