Scrapy start_requests() 没有产生所有请求
Scrapy start_requests() didn't yield all requests
def start_requests(self):
db = SeedUserGenerator()
result = db.selectSeedUsers()
db.closeDB()
urls = []
for name in result:
urls.append(self.user_info_url.format(name))
for url in urls:
yield Request(url=url, callback=self.parse_user, dont_filter=False, priority=10)
print('fin')
def parse_user(self, response):
.........ignore some code here...........
yield Request(url=next_url, priority=20, callback=self.parse_info)
def parse_info(self, response):
.........ignore some code here...........
yield Request(url=next_url, priority=30, callback=self.parse_user)
程序运行如下:
- 几个请求从
start_requests
产生,并且函数 start_requests
似乎暂停而没有输出字符串 fin
.
- 一个response来了,函数
parse_user
yield了另一个Request,但是函数start_requests
中剩下的Requests要等到response处理完才能yield,这里形成了yield操作一枚戒指。
好像是
同步:在从 start_requests
发送请求并处理其响应之前,其他请求不能被 yield?
这是否意味着 scrapy 永远无法产生函数中剩余的请求 start_requests
?
我怎样才能让 scrapy 完成 运行 start_requests
?
我是 python 的新手,并且很容易上手。 scrapy 可以同时处理响应和产生请求吗?
顺便说一下,我正在使用 Python3.6 和 Scrapy1.5.1 Twisted 20.3.0
我参考Scrapy引擎的源码解决了我的问题:
def _next_request(self, spider):
slot = self.slot
if not slot:
return
if self.paused:
return
while not self._needs_backout(spider):
if not self._next_request_from_scheduler(spider):
break
if slot.start_requests and not self._needs_backout(spider):
try:
request = next(slot.start_requests)
except StopIteration:
slot.start_requests = None
except Exception:
slot.start_requests = None
logger.error('Error while obtaining start requests',
exc_info=True, extra={'spider': spider})
else:
self.crawl(request, spider)
if self.spider_is_idle(spider) and slot.close_if_idle:
self._spider_idle(spider)
这里 Scrapy 总是首先尝试从调度程序的队列中获取请求,而不是 start_requests。
更重要的是,Scrapy从来不会把函数start_requests
的所有请求都放在第一位。
所以,我这样更改我的代码:
def start_requests(self):
db = SeedUserGenerator()
result = db.selectSeedUsers()
db.closeDB()
urls = []
for name in result:
urls.append(self.user_info_url.format(name))
yield Request(url=urls[0], callback=self.parse_temp, dont_filter=True, priority=10, meta={'urls': urls})
def parse_temp(self, response):
urls = response.meta['urls']
for url in urls:
print(url)
yield Request(url=url, callback=self.parse_user, dont_filter=False, priority=10)
print('fin2')
然后Scrapy先把所有请求放入队列
def start_requests(self):
db = SeedUserGenerator()
result = db.selectSeedUsers()
db.closeDB()
urls = []
for name in result:
urls.append(self.user_info_url.format(name))
for url in urls:
yield Request(url=url, callback=self.parse_user, dont_filter=False, priority=10)
print('fin')
def parse_user(self, response):
.........ignore some code here...........
yield Request(url=next_url, priority=20, callback=self.parse_info)
def parse_info(self, response):
.........ignore some code here...........
yield Request(url=next_url, priority=30, callback=self.parse_user)
程序运行如下:
- 几个请求从
start_requests
产生,并且函数start_requests
似乎暂停而没有输出字符串fin
. - 一个response来了,函数
parse_user
yield了另一个Request,但是函数start_requests
中剩下的Requests要等到response处理完才能yield,这里形成了yield操作一枚戒指。
好像是
同步:在从 start_requests
发送请求并处理其响应之前,其他请求不能被 yield?
这是否意味着 scrapy 永远无法产生函数中剩余的请求 start_requests
?
我怎样才能让 scrapy 完成 运行 start_requests
?
我是 python 的新手,并且很容易上手。 scrapy 可以同时处理响应和产生请求吗?
顺便说一下,我正在使用 Python3.6 和 Scrapy1.5.1 Twisted 20.3.0
我参考Scrapy引擎的源码解决了我的问题:
def _next_request(self, spider):
slot = self.slot
if not slot:
return
if self.paused:
return
while not self._needs_backout(spider):
if not self._next_request_from_scheduler(spider):
break
if slot.start_requests and not self._needs_backout(spider):
try:
request = next(slot.start_requests)
except StopIteration:
slot.start_requests = None
except Exception:
slot.start_requests = None
logger.error('Error while obtaining start requests',
exc_info=True, extra={'spider': spider})
else:
self.crawl(request, spider)
if self.spider_is_idle(spider) and slot.close_if_idle:
self._spider_idle(spider)
这里 Scrapy 总是首先尝试从调度程序的队列中获取请求,而不是 start_requests。
更重要的是,Scrapy从来不会把函数start_requests
的所有请求都放在第一位。
所以,我这样更改我的代码:
def start_requests(self):
db = SeedUserGenerator()
result = db.selectSeedUsers()
db.closeDB()
urls = []
for name in result:
urls.append(self.user_info_url.format(name))
yield Request(url=urls[0], callback=self.parse_temp, dont_filter=True, priority=10, meta={'urls': urls})
def parse_temp(self, response):
urls = response.meta['urls']
for url in urls:
print(url)
yield Request(url=url, callback=self.parse_user, dont_filter=False, priority=10)
print('fin2')
然后Scrapy先把所有请求放入队列