Scrapy start_requests() 没有产生所有请求

Scrapy start_requests() didn't yield all requests

    def start_requests(self):
        db = SeedUserGenerator()
        result = db.selectSeedUsers()
        db.closeDB()
        urls = []
        for name in result:
            urls.append(self.user_info_url.format(name))
        for url in urls:
            yield Request(url=url, callback=self.parse_user, dont_filter=False, priority=10)
        print('fin')

    def parse_user(self, response):
        .........ignore some code here...........
        yield Request(url=next_url, priority=20, callback=self.parse_info)

    def parse_info(self, response):
        .........ignore some code here...........
        yield Request(url=next_url, priority=30, callback=self.parse_user)

程序运行如下:

  1. 几个请求从 start_requests 产生,并且函数 start_requests 似乎暂停而没有输出字符串 fin.
  2. 一个response来了,函数parse_useryield了另一个Request,但是函数start_requests中剩下的Requests要等到response处理完才能yield,这里形成了yield操作一枚戒指。

好像是 同步:在从 start_requests 发送请求并处理其响应之前,其他请求不能被 yield?

这是否意味着 scrapy 永远无法产生函数中剩余的请求 start_requests

我怎样才能让 scrapy 完成 运行 start_requests

我是 python 的新手,并且很容易上手。 scrapy 可以同时处理响应和产生请求吗?

顺便说一下,我正在使用 Python3.6 和 Scrapy1.5.1 Twisted 20.3.0

我参考Scrapy引擎的源码解决了我的问题:

    def _next_request(self, spider):
        slot = self.slot
        if not slot:
            return

        if self.paused:
            return

        while not self._needs_backout(spider):
            if not self._next_request_from_scheduler(spider):
                break

        if slot.start_requests and not self._needs_backout(spider):
            try:
                request = next(slot.start_requests)
            except StopIteration:
                slot.start_requests = None
            except Exception:
                slot.start_requests = None
                logger.error('Error while obtaining start requests',
                             exc_info=True, extra={'spider': spider})
            else:
                self.crawl(request, spider)

        if self.spider_is_idle(spider) and slot.close_if_idle:
            self._spider_idle(spider)

这里 Scrapy 总是首先尝试从调度程序的队列中获取请求,而不是 start_requests。

更重要的是,Scrapy从来不会把函数start_requests的所有请求都放在第一位。

所以,我这样更改我的代码:

    def start_requests(self):
        db = SeedUserGenerator()
        result = db.selectSeedUsers()
        db.closeDB()
        urls = []
        for name in result:
            urls.append(self.user_info_url.format(name))
        yield Request(url=urls[0], callback=self.parse_temp, dont_filter=True, priority=10, meta={'urls': urls})

    def parse_temp(self, response):
        urls = response.meta['urls']
        for url in urls:
            print(url)
            yield Request(url=url, callback=self.parse_user, dont_filter=False, priority=10)
        print('fin2')

然后Scrapy先把所有请求放入队列