按计划进行 Scrapy

Question

让 Scrapy 按计划运行驱使我绕过 Twist(ed)。

我认为下面的测试代码可以工作，但是当蜘蛛第二次被触发时我得到了一个 twisted.internet.error.ReactorNotRestartable 错误：

from quotesbot.spiders.quotes import QuotesSpider
import schedule
import time
from scrapy.crawler import CrawlerProcess

def run_spider_script():
    process.crawl(QuotesSpider)
    process.start()


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})


schedule.every(5).seconds.do(run_spider_script)

while True:
    schedule.run_pending()
    time.sleep(1)

我猜测，作为 CrawlerProcess 的一部分，Twisted Reactor 会在不需要时调用以重新启动，因此程序会崩溃。有什么办法可以控制吗？

同样在这个阶段，如果有另一种方法可以按计划将 Scrapy 蜘蛛自动化到运行，我洗耳恭听。我试过 scrapy.cmdline.execute ，但也无法循环：

from quotesbot.spiders.quotes import QuotesSpider
from scrapy import cmdline
import schedule
import time
from scrapy.crawler import CrawlerProcess


def run_spider_cmd():
    print("Running spider")
    cmdline.execute("scrapy crawl quotes".split())


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})


schedule.every(5).seconds.do(run_spider_cmd)

while True:
    schedule.run_pending()
    time.sleep(1)

编辑

添加代码，每隔几秒使用 Twisted task.LoopingCall() 到运行测试蜘蛛。我是否正在以完全错误的方式安排每天同一时间运行的蜘蛛？

from twisted.internet import reactor
from twisted.internet import task
from scrapy.crawler import CrawlerRunner
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):

        quotes = response.xpath('//div[@class="quote"]')

        for quote in quotes:

            author = quote.xpath('.//small[@class="author"]/text()').extract_first()
            text = quote.xpath('.//span[@class="text"]/text()').extract_first()

            print(author, text)


def run_crawl():

    runner = CrawlerRunner()
    runner.crawl(QuotesSpider)


l = task.LoopingCall(run_crawl)
l.start(3)

reactor.run()

Answer 1

第一个值得注意的声明，通常只有一个扭曲反应器运行并且它不可重启（如您所见）。第二个是应该避免阻塞 tasks/functions（即 time.sleep(n)），并且应该用异步替代方案（例如 'reactor.task.deferLater(n,...)`）代替。

要从 Twisted 项目中有效地使用 Scrapy 需要 scrapy.crawler.CrawlerRunner 核心 API 而不是 scrapy.crawler.CrawlerProcess。两者之间的主要区别在于 CrawlerProcess 为您运行 Twisted 的 reactor（因此很难重新启动反应堆），而 CrawlerRunner 则依赖于开发人员来启动反应堆。使用 CrawlerRunner:

时，您的代码可能如下所示

from twisted.internet import reactor
from quotesbot.spiders.quotes import QuotesSpider
from scrapy.crawler import CrawlerRunner

def run_crawl():
    """
    Run a spider within Twisted. Once it completes,
    wait 5 seconds and run another spider.
    """
    runner = CrawlerRunner({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        })
    deferred = runner.crawl(QuotesSpider)
    # you can use reactor.callLater or task.deferLater to schedule a function
    deferred.addCallback(reactor.callLater, 5, run_crawl)
    return deferred

run_crawl()
reactor.run()   # you have to run the reactor yourself

Answer 2

你可以使用 apscheduler

pip install apscheduler

# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

from Demo.spiders.baidu import YourSpider

process = CrawlerProcess(get_project_settings())
scheduler = TwistedScheduler()
scheduler.add_job(process.crawl, 'interval', args=[YourSpider], seconds=10)
scheduler.start()
process.start(False)

按计划进行 Scrapy

Scrapy on a schedule

python

twisted

scrapy

web-scraping