当 CrawlerProcess 为 运行 两次时,Scrapy 引发 ReactorNotRestartable
Scrapy raises ReactorNotRestartable when CrawlerProcess is ran twice
我有一些代码看起来像这样:
def run(spider_name, settings):
runner = CrawlerProcess(settings)
runner.crawl(spider_name)
runner.start()
return True
我有两个 py.test 测试,每个测试调用 运行(),当第二个测试执行时,我得到以下错误。
runner.start()
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/scrapy/crawler.py:291: in start
reactor.run(installSignalHandlers=False) # blocking call
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1242: in run
self.startRunning(installSignalHandlers=installSignalHandlers)
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1222: in startRunning
ReactorBase.startRunning(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <twisted.internet.selectreactor.SelectReactor object at 0x10fe21588>
def startRunning(self):
"""
Method called when reactor starts: do some initialization and fire
startup events.
Don't call this directly, call reactor.run() instead: it should take
care of calling this.
This method is somewhat misnamed. The reactor will not necessarily be
in the running state by the time this method returns. The only
guarantee is that it will be on its way to the running state.
"""
if self._started:
raise error.ReactorAlreadyRunning()
if self._startedBefore:
> raise error.ReactorNotRestartable()
E twisted.internet.error.ReactorNotRestartable
我知道这个反应堆已经 运行ning 所以我不能 runner.start()
当第二次测试 运行 时。但是有什么方法可以在测试之间重置它的状态吗?所以他们更加孤立,实际上可以 运行 一个接一个。
By default, Scrapy runs a single spider per process when you run
scrapy crawl. However, Scrapy supports running multiple spiders per
process using the internal API.
例如:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
如果你想在调用 process.start
之后 运行 另一个蜘蛛,那么我希望你可以在你的程序中确定需要这样做。
给出了其他场景的例子in the docs。
如果你使用 CrawlerRunner
instead of CrawlerProcess
in conjunction with pytest-twisted
,你应该可以像这样使用 运行 你的测试:
为 Pytest 安装 Twisted 集成:pip install pytest-twisted
from scrapy.crawler import CrawlerRunner
def _run_crawler(spider_cls, settings):
"""
spider_cls: Scrapy Spider class
settings: Scrapy settings
returns: Twisted Deferred
"""
runner = CrawlerRunner(settings)
return runner.crawl(spider_cls) # return Deferred
def test_scrapy_crawler():
deferred = _run_crawler(MySpider, settings)
@deferred.addCallback
def _success(results):
"""
After crawler completes, this function will execute.
Do your assertions in this function.
"""
@deferred.addErrback
def _error(failure):
raise failure.value
return deferred
说白了,_run_crawler()
会在 Twisted 反应器中安排一次抓取,并在抓取完成时执行回调。在这些回调(_success()
和 _error()
)中,您将进行断言。最后,您必须 return 来自 _run_crawler()
的 Deferred
对象,以便测试等到抓取完成。 Deferred
这部分是必不可少的,必须对所有测试进行。
这是一个示例,说明如何使用 gatherResults
.
运行 多次抓取和聚合结果
from twisted.internet import defer
def test_multiple_crawls():
d1 = _run_crawler(Spider1, settings)
d2 = _run_crawler(Spider2, settings)
d_list = defer.gatherResults([d1, d2])
@d_list.addCallback
def _success(results):
assert True
@d_list.addErrback
def _error(failure):
assert False
return d_list
希望对您有所帮助,如果没有帮助,请询问您遇到的问题。
我有一些代码看起来像这样:
def run(spider_name, settings):
runner = CrawlerProcess(settings)
runner.crawl(spider_name)
runner.start()
return True
我有两个 py.test 测试,每个测试调用 运行(),当第二个测试执行时,我得到以下错误。
runner.start()
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/scrapy/crawler.py:291: in start
reactor.run(installSignalHandlers=False) # blocking call
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1242: in run
self.startRunning(installSignalHandlers=installSignalHandlers)
../../.virtualenvs/scrape-service/lib/python3.6/site-packages/twisted/internet/base.py:1222: in startRunning
ReactorBase.startRunning(self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <twisted.internet.selectreactor.SelectReactor object at 0x10fe21588>
def startRunning(self):
"""
Method called when reactor starts: do some initialization and fire
startup events.
Don't call this directly, call reactor.run() instead: it should take
care of calling this.
This method is somewhat misnamed. The reactor will not necessarily be
in the running state by the time this method returns. The only
guarantee is that it will be on its way to the running state.
"""
if self._started:
raise error.ReactorAlreadyRunning()
if self._startedBefore:
> raise error.ReactorNotRestartable()
E twisted.internet.error.ReactorNotRestartable
我知道这个反应堆已经 运行ning 所以我不能 runner.start()
当第二次测试 运行 时。但是有什么方法可以在测试之间重置它的状态吗?所以他们更加孤立,实际上可以 运行 一个接一个。
By default, Scrapy runs a single spider per process when you run scrapy crawl. However, Scrapy supports running multiple spiders per process using the internal API.
例如:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
如果你想在调用 process.start
之后 运行 另一个蜘蛛,那么我希望你可以在你的程序中确定需要这样做。
给出了其他场景的例子in the docs。
如果你使用 CrawlerRunner
instead of CrawlerProcess
in conjunction with pytest-twisted
,你应该可以像这样使用 运行 你的测试:
为 Pytest 安装 Twisted 集成:pip install pytest-twisted
from scrapy.crawler import CrawlerRunner
def _run_crawler(spider_cls, settings):
"""
spider_cls: Scrapy Spider class
settings: Scrapy settings
returns: Twisted Deferred
"""
runner = CrawlerRunner(settings)
return runner.crawl(spider_cls) # return Deferred
def test_scrapy_crawler():
deferred = _run_crawler(MySpider, settings)
@deferred.addCallback
def _success(results):
"""
After crawler completes, this function will execute.
Do your assertions in this function.
"""
@deferred.addErrback
def _error(failure):
raise failure.value
return deferred
说白了,_run_crawler()
会在 Twisted 反应器中安排一次抓取,并在抓取完成时执行回调。在这些回调(_success()
和 _error()
)中,您将进行断言。最后,您必须 return 来自 _run_crawler()
的 Deferred
对象,以便测试等到抓取完成。 Deferred
这部分是必不可少的,必须对所有测试进行。
这是一个示例,说明如何使用 gatherResults
.
from twisted.internet import defer
def test_multiple_crawls():
d1 = _run_crawler(Spider1, settings)
d2 = _run_crawler(Spider2, settings)
d_list = defer.gatherResults([d1, d2])
@d_list.addCallback
def _success(results):
assert True
@d_list.addErrback
def _error(failure):
assert False
return d_list
希望对您有所帮助,如果没有帮助,请询问您遇到的问题。