有没有办法重新启动 scrapy 爬虫?
Is there a way to restart a scrapy crawler?
我想知道是否有办法重新启动 scrapy 爬虫。这是我的代码的样子:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
results = set([])
class SitemapCrawler(CrawlSpider):
name = "Crawler"
start_urls = ['www.example.com']
allowed_domains = ['www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
def parse_links(self, response):
href = response.xpath('//a/@href').getall()
results.add(response.url)
for link in href:
results.add(link)
def start():
process.crawl(Crawler)
process.start()
for link in results:
print(link)
如果我尝试调用 start()
两次,它会运行一次而不是给我这个错误:
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
我知道这是一个一般性问题,所以我不希望有任何代码,但我只想知道如何解决这个问题。提前致谢。
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
#Spider definition
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished():
print("finished :D")
d.addCallback(finished)
reactor.run()
我想知道是否有办法重新启动 scrapy 爬虫。这是我的代码的样子:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.crawler import CrawlerProcess
results = set([])
class SitemapCrawler(CrawlSpider):
name = "Crawler"
start_urls = ['www.example.com']
allowed_domains = ['www.example.com']
rules = [Rule(LinkExtractor(), callback='parse_links', follow=True)]
def parse_links(self, response):
href = response.xpath('//a/@href').getall()
results.add(response.url)
for link in href:
results.add(link)
def start():
process.crawl(Crawler)
process.start()
for link in results:
print(link)
如果我尝试调用 start()
两次,它会运行一次而不是给我这个错误:
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
我知道这是一个一般性问题,所以我不希望有任何代码,但我只想知道如何解决这个问题。提前致谢。
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
#Spider definition
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(MySpider)
def finished():
print("finished :D")
d.addCallback(finished)
reactor.run()