当 CrawlerProcess 为 运行 两次时,Scrapy 引发 ReactorNotRestartable

Scrapy raises ReactorNotRestartable when CrawlerProcess is ran twice

我有一些代码看起来像这样:

def run(spider_name, settings):
    runner = CrawlerProcess(settings)
    runner.crawl(spider_name)
    runner.start()
    return True

我有两个 py.test 测试,每个测试调用 运行(),当第二个测试执行时,我得到以下错误。

    runner.start()
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/scrapy/crawler.py:291: in start
    reactor.run(installSignalHandlers=False)  # blocking call
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1242: in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1222: in startRunning
    ReactorBase.startRunning(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <twisted.internet.selectreactor.SelectReactor object at 0x10fe21588>

    def startRunning(self):
        """
            Method called when reactor starts: do some initialization and fire
            startup events.

            Don't call this directly, call reactor.run() instead: it should take
            care of calling this.

            This method is somewhat misnamed.  The reactor will not necessarily be
            in the running state by the time this method returns.  The only
            guarantee is that it will be on its way to the running state.
            """
        if self._started:
            raise error.ReactorAlreadyRunning()
        if self._startedBefore:
>           raise error.ReactorNotRestartable()
E           twisted.internet.error.ReactorNotRestartable

我知道这个反应堆已经 运行ning 所以我不能 runner.start() 当第二次测试 运行 时。但是有什么方法可以在测试之间重置它的状态吗?所以他们更加孤立,实际上可以 运行 一个接一个。

According to the scrapy docs:

By default, Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the internal API.

例如:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

如果你想在调用 process.start 之后 运行 另一个蜘蛛,那么我希望你可以在你的程序中确定需要这样做。

给出了其他场景的例子in the docs

如果你使用 CrawlerRunner instead of CrawlerProcess in conjunction with pytest-twisted,你应该可以像这样使用 运行 你的测试:

为 Pytest 安装 Twisted 集成:pip install pytest-twisted

from scrapy.crawler import CrawlerRunner

def _run_crawler(spider_cls, settings):
    """
    spider_cls: Scrapy Spider class
    settings: Scrapy settings
    returns: Twisted Deferred
    """
    runner = CrawlerRunner(settings)
    return runner.crawl(spider_cls)     # return Deferred


def test_scrapy_crawler():
    deferred = _run_crawler(MySpider, settings)

    @deferred.addCallback
    def _success(results):
        """
        After crawler completes, this function will execute.
        Do your assertions in this function.
        """

    @deferred.addErrback
    def _error(failure):
        raise failure.value

    return deferred

说白了,_run_crawler() 会在 Twisted 反应器中安排一次抓取,并在抓取完成时执行回调。在这些回调(_success()_error())中,您将进行断言。最后,您必须 return 来自 _run_crawler()Deferred 对象,以便测试等到抓取完成。 Deferred 这部分是必不可少的,必须对所有测试进行。

这是一个示例,说明如何使用 gatherResults.

运行 多次抓取和聚合结果
from twisted.internet import defer

def test_multiple_crawls():
    d1 = _run_crawler(Spider1, settings)
    d2 = _run_crawler(Spider2, settings)

    d_list = defer.gatherResults([d1, d2])

    @d_list.addCallback
    def _success(results):
        assert True

    @d_list.addErrback
    def _error(failure):
        assert False

    return d_list

希望对您有所帮助,如果没有帮助,请询问您遇到的问题。