运行 1 个网站的 scrapy 中的多个蜘蛛并行?
Running Multiple spiders in scrapy for 1 website in parallel?
我想抓取一个包含 2 个部分的网站,但我的脚本没有我需要的那么快。
是否可以启动 2 个蜘蛛,一个用于抓取第一部分,另一个用于抓取第二部分?
我尝试了 2 个不同的 类,运行 它们
scrapy crawl firstSpider
scrapy crawl secondSpider
但我认为它不聪明。
我阅读了 documentation of scrapyd 但我不知道它是否适合我的情况。
我想你要找的是这样的:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
您可以在以下位置阅读更多信息:running-multiple-spiders-in-the-same-process。
或者你可以运行这样,你需要将这段代码保存在与scrapy.cfg相同的目录中(我的scrapy版本是1.3.3):
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spiders.list():
print ("Running spider %s" % (spider_name))
process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
process.start()
更好的解决方案是(如果你有多个蜘蛛)动态获取蜘蛛和 运行 它们。
from scrapy import spiderloader
from scrapy.utils import project
from twisted.internet.defer import inlineCallbacks
@inlineCallbacks
def crawl():
settings = project.get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
spiders = spider_loader.list()
classes = [spider_loader.load(name) for name in spiders]
for my_spider in classes:
yield runner.crawl(my_spider)
reactor.stop()
crawl()
reactor.run()
(第二种解法):
因为 spiders.list()
在 Scrapy 1.4 中被弃用 Yuda 解决方案应该转换成类似
from scrapy import spiderloader
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
settings = get_project_settings()
process = CrawlerProcess(settings)
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
for spider_name in spider_loader.list():
print("Running spider %s" % (spider_name))
process.crawl(spider_name)
process.start()
我想抓取一个包含 2 个部分的网站,但我的脚本没有我需要的那么快。
是否可以启动 2 个蜘蛛,一个用于抓取第一部分,另一个用于抓取第二部分?
我尝试了 2 个不同的 类,运行 它们
scrapy crawl firstSpider
scrapy crawl secondSpider
但我认为它不聪明。
我阅读了 documentation of scrapyd 但我不知道它是否适合我的情况。
我想你要找的是这样的:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
您可以在以下位置阅读更多信息:running-multiple-spiders-in-the-same-process。
或者你可以运行这样,你需要将这段代码保存在与scrapy.cfg相同的目录中(我的scrapy版本是1.3.3):
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
setting = get_project_settings()
process = CrawlerProcess(setting)
for spider_name in process.spiders.list():
print ("Running spider %s" % (spider_name))
process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy
process.start()
更好的解决方案是(如果你有多个蜘蛛)动态获取蜘蛛和 运行 它们。
from scrapy import spiderloader
from scrapy.utils import project
from twisted.internet.defer import inlineCallbacks
@inlineCallbacks
def crawl():
settings = project.get_project_settings()
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
spiders = spider_loader.list()
classes = [spider_loader.load(name) for name in spiders]
for my_spider in classes:
yield runner.crawl(my_spider)
reactor.stop()
crawl()
reactor.run()
(第二种解法):
因为 spiders.list()
在 Scrapy 1.4 中被弃用 Yuda 解决方案应该转换成类似
from scrapy import spiderloader
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
settings = get_project_settings()
process = CrawlerProcess(settings)
spider_loader = spiderloader.SpiderLoader.from_settings(settings)
for spider_name in spider_loader.list():
print("Running spider %s" % (spider_name))
process.crawl(spider_name)
process.start()