按计划进行 Scrapy
Scrapy on a schedule
让 Scrapy 按计划 运行 驱使我绕过 Twist(ed)。
我认为下面的测试代码可以工作,但是当蜘蛛第二次被触发时我得到了一个 twisted.internet.error.ReactorNotRestartable
错误:
from quotesbot.spiders.quotes import QuotesSpider
import schedule
import time
from scrapy.crawler import CrawlerProcess
def run_spider_script():
process.crawl(QuotesSpider)
process.start()
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
schedule.every(5).seconds.do(run_spider_script)
while True:
schedule.run_pending()
time.sleep(1)
我猜测,作为 CrawlerProcess 的一部分,Twisted Reactor 会在不需要时调用以重新启动,因此程序会崩溃。有什么办法可以控制吗?
同样在这个阶段,如果有另一种方法可以按计划将 Scrapy 蜘蛛自动化到 运行,我洗耳恭听。我试过 scrapy.cmdline.execute
,但也无法循环:
from quotesbot.spiders.quotes import QuotesSpider
from scrapy import cmdline
import schedule
import time
from scrapy.crawler import CrawlerProcess
def run_spider_cmd():
print("Running spider")
cmdline.execute("scrapy crawl quotes".split())
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
schedule.every(5).seconds.do(run_spider_cmd)
while True:
schedule.run_pending()
time.sleep(1)
编辑
添加代码,每隔几秒使用 Twisted task.LoopingCall()
到 运行 测试蜘蛛。我是否正在以完全错误的方式安排每天同一时间 运行 的蜘蛛?
from twisted.internet import reactor
from twisted.internet import task
from scrapy.crawler import CrawlerRunner
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
author = quote.xpath('.//small[@class="author"]/text()').extract_first()
text = quote.xpath('.//span[@class="text"]/text()').extract_first()
print(author, text)
def run_crawl():
runner = CrawlerRunner()
runner.crawl(QuotesSpider)
l = task.LoopingCall(run_crawl)
l.start(3)
reactor.run()
第一个值得注意的声明,通常只有 一个 扭曲反应器 运行 并且它不可重启(如您所见)。第二个是应该避免阻塞 tasks/functions(即 time.sleep(n)
),并且应该用异步替代方案(例如 'reactor.task.deferLater(n,...)`)代替。
要从 Twisted 项目中有效地使用 Scrapy 需要 scrapy.crawler.CrawlerRunner
核心 API 而不是 scrapy.crawler.CrawlerProcess
。两者之间的主要区别在于 CrawlerProcess
为您运行 Twisted 的 reactor
(因此很难重新启动反应堆),而 CrawlerRunner
则依赖于开发人员来启动反应堆。使用 CrawlerRunner
:
时,您的代码可能如下所示
from twisted.internet import reactor
from quotesbot.spiders.quotes import QuotesSpider
from scrapy.crawler import CrawlerRunner
def run_crawl():
"""
Run a spider within Twisted. Once it completes,
wait 5 seconds and run another spider.
"""
runner = CrawlerRunner({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
deferred = runner.crawl(QuotesSpider)
# you can use reactor.callLater or task.deferLater to schedule a function
deferred.addCallback(reactor.callLater, 5, run_crawl)
return deferred
run_crawl()
reactor.run() # you have to run the reactor yourself
你可以使用 apscheduler
pip install apscheduler
# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler
from Demo.spiders.baidu import YourSpider
process = CrawlerProcess(get_project_settings())
scheduler = TwistedScheduler()
scheduler.add_job(process.crawl, 'interval', args=[YourSpider], seconds=10)
scheduler.start()
process.start(False)
让 Scrapy 按计划 运行 驱使我绕过 Twist(ed)。
我认为下面的测试代码可以工作,但是当蜘蛛第二次被触发时我得到了一个 twisted.internet.error.ReactorNotRestartable
错误:
from quotesbot.spiders.quotes import QuotesSpider
import schedule
import time
from scrapy.crawler import CrawlerProcess
def run_spider_script():
process.crawl(QuotesSpider)
process.start()
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
schedule.every(5).seconds.do(run_spider_script)
while True:
schedule.run_pending()
time.sleep(1)
我猜测,作为 CrawlerProcess 的一部分,Twisted Reactor 会在不需要时调用以重新启动,因此程序会崩溃。有什么办法可以控制吗?
同样在这个阶段,如果有另一种方法可以按计划将 Scrapy 蜘蛛自动化到 运行,我洗耳恭听。我试过 scrapy.cmdline.execute
,但也无法循环:
from quotesbot.spiders.quotes import QuotesSpider
from scrapy import cmdline
import schedule
import time
from scrapy.crawler import CrawlerProcess
def run_spider_cmd():
print("Running spider")
cmdline.execute("scrapy crawl quotes".split())
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
schedule.every(5).seconds.do(run_spider_cmd)
while True:
schedule.run_pending()
time.sleep(1)
编辑
添加代码,每隔几秒使用 Twisted task.LoopingCall()
到 运行 测试蜘蛛。我是否正在以完全错误的方式安排每天同一时间 运行 的蜘蛛?
from twisted.internet import reactor
from twisted.internet import task
from scrapy.crawler import CrawlerRunner
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
author = quote.xpath('.//small[@class="author"]/text()').extract_first()
text = quote.xpath('.//span[@class="text"]/text()').extract_first()
print(author, text)
def run_crawl():
runner = CrawlerRunner()
runner.crawl(QuotesSpider)
l = task.LoopingCall(run_crawl)
l.start(3)
reactor.run()
第一个值得注意的声明,通常只有 一个 扭曲反应器 运行 并且它不可重启(如您所见)。第二个是应该避免阻塞 tasks/functions(即 time.sleep(n)
),并且应该用异步替代方案(例如 'reactor.task.deferLater(n,...)`)代替。
要从 Twisted 项目中有效地使用 Scrapy 需要 scrapy.crawler.CrawlerRunner
核心 API 而不是 scrapy.crawler.CrawlerProcess
。两者之间的主要区别在于 CrawlerProcess
为您运行 Twisted 的 reactor
(因此很难重新启动反应堆),而 CrawlerRunner
则依赖于开发人员来启动反应堆。使用 CrawlerRunner
:
from twisted.internet import reactor
from quotesbot.spiders.quotes import QuotesSpider
from scrapy.crawler import CrawlerRunner
def run_crawl():
"""
Run a spider within Twisted. Once it completes,
wait 5 seconds and run another spider.
"""
runner = CrawlerRunner({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})
deferred = runner.crawl(QuotesSpider)
# you can use reactor.callLater or task.deferLater to schedule a function
deferred.addCallback(reactor.callLater, 5, run_crawl)
return deferred
run_crawl()
reactor.run() # you have to run the reactor yourself
你可以使用 apscheduler
pip install apscheduler
# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler
from Demo.spiders.baidu import YourSpider
process = CrawlerProcess(get_project_settings())
scheduler = TwistedScheduler()
scheduler.add_job(process.crawl, 'interval', args=[YourSpider], seconds=10)
scheduler.start()
process.start(False)