在扭曲的反应器中具有数据依赖性的链式 Scrapy 蜘蛛
Chain Scrapy Spiders which have data dependencies in a twisted reactor
实际上 scrapy 文档解释了如何像这样链接两个 spyder
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
但在我的用例中,MySpider2
需要 MySpider1
在使用某些 transformFunction()
进行转换后检索到的信息。
所以我想要这样的东西:
def transformFunction():
... transforme data retrieved by spyder1 ...
return newdata
def crawl():
yield runner.crawl(MySpider1)
newdata = transformFunction()
yield runner.crawl(MySpider2, data=newData)
reactor.stop()
我想安排什么:
MySpider1
开始,将data
写入磁盘然后退出
transformFunction()
将 data
转换为 newdata
MySpider2
开始使用newData
那么我如何使用 twisted reactor 和 scrapy 来管理这种行为呢?
runner.crawl
returns a Deferred
这样您就可以将回调链接到它。必须对您的代码进行细微调整。
from twisted.internet import task
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
configure_logging()
def crawl(reactor):
runner = CrawlerRunner()
d = runner.crawl(MySpider1)
d.addCallback(transformFunction)
d.addCallback(crawl2, runner)
return d
def transformFunction(result):
# crawl doesn't usually return any results if successful so ignore result var here
# ...
return newdata
def crawl2(result, runner):
# result == newdata from transformFunction
# runner is passed in from crawl()
return runner.crawl(MySpider2, data=result)
task.react(crawl)
主要功能是 crawl()
,它由 task.react()
执行,它将为您启动和停止反应堆。 Deferred
从 runner.crawl()
返回,transformFunction
+ crawl2
函数链接到它,以便当一个函数完成时,下一个函数开始。
实际上 scrapy 文档解释了如何像这样链接两个 spyder
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
但在我的用例中,MySpider2
需要 MySpider1
在使用某些 transformFunction()
进行转换后检索到的信息。
所以我想要这样的东西:
def transformFunction():
... transforme data retrieved by spyder1 ...
return newdata
def crawl():
yield runner.crawl(MySpider1)
newdata = transformFunction()
yield runner.crawl(MySpider2, data=newData)
reactor.stop()
我想安排什么:
MySpider1
开始,将data
写入磁盘然后退出transformFunction()
将data
转换为newdata
MySpider2
开始使用newData
那么我如何使用 twisted reactor 和 scrapy 来管理这种行为呢?
runner.crawl
returns a Deferred
这样您就可以将回调链接到它。必须对您的代码进行细微调整。
from twisted.internet import task
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
configure_logging()
def crawl(reactor):
runner = CrawlerRunner()
d = runner.crawl(MySpider1)
d.addCallback(transformFunction)
d.addCallback(crawl2, runner)
return d
def transformFunction(result):
# crawl doesn't usually return any results if successful so ignore result var here
# ...
return newdata
def crawl2(result, runner):
# result == newdata from transformFunction
# runner is passed in from crawl()
return runner.crawl(MySpider2, data=result)
task.react(crawl)
主要功能是 crawl()
,它由 task.react()
执行,它将为您启动和停止反应堆。 Deferred
从 runner.crawl()
返回,transformFunction
+ crawl2
函数链接到它,以便当一个函数完成时,下一个函数开始。