Scrapy 多处理
Scrapy multiprocessing
我正在尝试构建一个爬虫(使用 scrapy)从 main.py 多处理启动蜘蛛。
第一个 spider (cat_1) 在没有使用 scrapy.crawler.CrawlerProcess
进行多重处理的情况下启动:
crawler_settings = Settings()
crawler_settings.setmodule(default_settings)
runner = CrawlerProcess(settings=crawler_settings)
runner.crawl(cat_1)
runner.start(stop_after_crawl=True)
它工作正常,我有 FEED 处理的所有数据。
下一个蜘蛛需要第一个蜘蛛的结果并进行多处理:
从第一个蜘蛛加载结果后,我创建了一个 URL 列表并将其发送到我的函数 process_cat_2()
。此函数创建进程,每个进程都会启动蜘蛛 cat_2 :
from multiprocessing import Process
def launch_crawler_cat_2(crawler, url):
cat_name = url[0]
cat_url = url[1]
runner.crawl(crawler, cat_name, cat_url)
def process_cat_2(url_list):
nb_spiders = len(url_list)
list_process = [None] * nb_spiders
while(url_list):
for i in range(nb_spiders):
if not (list_process[i] and list_process[i].is_alive()):
list_process[i] = Process(target=launch_crawler_cat_2, args=(cat_2, url_list.pop(0)))
list_process[i].start()
# break
# Wait all thread end
for process in list_process:
if process:
# process.start()
process.join()
问题是 runner.crawl(crawler, cat_name, cat_url)
(在 cat_2 中)没有抓取任何东西:
2021-10-07 17:20:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
而且我不知道如何使用现有的 twisted.internet.reactor
所以要避免这个错误:
twisted.internet.error.ReactorNotRestartable
使用时:
def launch_crawler_cat_2(crawler, url):
cat_name = url[0]
cat_url = url[1]
runner.crawl(crawler, cat_name, cat_url)
runner.start()
如何使用现有的反应器对象启动新的蜘蛛?
这里有一个解决方案,适用于那些和我一样的人。我能够 运行 多个蜘蛛,其中一些蜘蛛需要以前的结果,而一些蜘蛛需要多处理。
在不同的进程中初始化每个 Crawler :
import sys
import json
import pandas as pd
from multiprocessing import Process
## Scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
## Spiders & settings
from myproject.spiders.cat_1 import cat_1
from myproject.spiders.cat_2 import cat_2
from myproject import settings as default_settings
## Init crawler
crawler_settings = Settings()
crawler_settings.setmodule(default_settings)
# runner = CrawlerRunner(settings=default_settings)
runner = CrawlerProcess(settings=crawler_settings)
def launch_crawler_cat_2(crawler, url):
process = CrawlerProcess(crawler_settings)
process.crawl(crawler,url[0],url[1])
process.start(stop_after_crawl=True)
def process_cat_2(url_list):
nb_spiders = 5
list_process = [None] * nb_spiders
while(url_list):
for i in range(nb_spiders):
if not (list_process[i] and list_process[i].is_alive()):
list_process[i] = Process(target=launch_crawler_cat_2, args=(cat_2, url_list.pop(0)))
list_process[i].start()
break
# Wait all thread end
for process in list_process:
if process:
process.join()
def crawl_cat_1():
process = CrawlerProcess(crawler_settings)
process.crawl(cat_1)
process.start(stop_after_crawl=True)
if __name__=="__main__":
## Scrape cat_1
process_cat_1 = Process(target=crawl_cat_1)
process_cat_1.start()
process_cat_1.join()
##########################################################################
########## LOAD cat_1 RESULTS
try:
with open('./cat_1.json', 'r+', encoding="utf-8") as f:
lines = f.readlines()
lines = [json.loads(line) for line in lines]
df_cat_1 = pd.DataFrame(lines)
except:
df_cat_1 = pd.DataFrame([])
print(df_cat_1)
if df_cat_1.empty:
sys.exit('df_cat_1 empty DataFrame')
df_cat_1['cat_1_tuple'] = list(zip(df_cat_1.cat_name, df_cat_1.cat_url))
df_cat_1_tuple_list = df_cat_1.cat_1_tuple.tolist()
process_cat_2(df_cat_1_tuple_list)
好吧..我找到了 运行 多个蜘蛛的解决方案,多次使用 CrawlerRunner 按照文档 https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
的推荐
There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.
Using this class the reactor should be explicitly run after scheduling your spiders. It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.
这是我的解决方案:
我正在尝试构建一个爬虫(使用 scrapy)从 main.py 多处理启动蜘蛛。
第一个 spider (cat_1) 在没有使用 scrapy.crawler.CrawlerProcess
进行多重处理的情况下启动:
crawler_settings = Settings()
crawler_settings.setmodule(default_settings)
runner = CrawlerProcess(settings=crawler_settings)
runner.crawl(cat_1)
runner.start(stop_after_crawl=True)
它工作正常,我有 FEED 处理的所有数据。
下一个蜘蛛需要第一个蜘蛛的结果并进行多处理:
从第一个蜘蛛加载结果后,我创建了一个 URL 列表并将其发送到我的函数 process_cat_2()
。此函数创建进程,每个进程都会启动蜘蛛 cat_2 :
from multiprocessing import Process
def launch_crawler_cat_2(crawler, url):
cat_name = url[0]
cat_url = url[1]
runner.crawl(crawler, cat_name, cat_url)
def process_cat_2(url_list):
nb_spiders = len(url_list)
list_process = [None] * nb_spiders
while(url_list):
for i in range(nb_spiders):
if not (list_process[i] and list_process[i].is_alive()):
list_process[i] = Process(target=launch_crawler_cat_2, args=(cat_2, url_list.pop(0)))
list_process[i].start()
# break
# Wait all thread end
for process in list_process:
if process:
# process.start()
process.join()
问题是 runner.crawl(crawler, cat_name, cat_url)
(在 cat_2 中)没有抓取任何东西:
2021-10-07 17:20:38 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
而且我不知道如何使用现有的 twisted.internet.reactor
所以要避免这个错误:
twisted.internet.error.ReactorNotRestartable
使用时:
def launch_crawler_cat_2(crawler, url):
cat_name = url[0]
cat_url = url[1]
runner.crawl(crawler, cat_name, cat_url)
runner.start()
如何使用现有的反应器对象启动新的蜘蛛?
这里有一个解决方案,适用于那些和我一样的人。我能够 运行 多个蜘蛛,其中一些蜘蛛需要以前的结果,而一些蜘蛛需要多处理。
在不同的进程中初始化每个 Crawler :
import sys
import json
import pandas as pd
from multiprocessing import Process
## Scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
## Spiders & settings
from myproject.spiders.cat_1 import cat_1
from myproject.spiders.cat_2 import cat_2
from myproject import settings as default_settings
## Init crawler
crawler_settings = Settings()
crawler_settings.setmodule(default_settings)
# runner = CrawlerRunner(settings=default_settings)
runner = CrawlerProcess(settings=crawler_settings)
def launch_crawler_cat_2(crawler, url):
process = CrawlerProcess(crawler_settings)
process.crawl(crawler,url[0],url[1])
process.start(stop_after_crawl=True)
def process_cat_2(url_list):
nb_spiders = 5
list_process = [None] * nb_spiders
while(url_list):
for i in range(nb_spiders):
if not (list_process[i] and list_process[i].is_alive()):
list_process[i] = Process(target=launch_crawler_cat_2, args=(cat_2, url_list.pop(0)))
list_process[i].start()
break
# Wait all thread end
for process in list_process:
if process:
process.join()
def crawl_cat_1():
process = CrawlerProcess(crawler_settings)
process.crawl(cat_1)
process.start(stop_after_crawl=True)
if __name__=="__main__":
## Scrape cat_1
process_cat_1 = Process(target=crawl_cat_1)
process_cat_1.start()
process_cat_1.join()
##########################################################################
########## LOAD cat_1 RESULTS
try:
with open('./cat_1.json', 'r+', encoding="utf-8") as f:
lines = f.readlines()
lines = [json.loads(line) for line in lines]
df_cat_1 = pd.DataFrame(lines)
except:
df_cat_1 = pd.DataFrame([])
print(df_cat_1)
if df_cat_1.empty:
sys.exit('df_cat_1 empty DataFrame')
df_cat_1['cat_1_tuple'] = list(zip(df_cat_1.cat_name, df_cat_1.cat_url))
df_cat_1_tuple_list = df_cat_1.cat_1_tuple.tolist()
process_cat_2(df_cat_1_tuple_list)
好吧..我找到了 运行 多个蜘蛛的解决方案,多次使用 CrawlerRunner 按照文档 https://docs.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
的推荐There’s another Scrapy utility that provides more control over the crawling process: scrapy.crawler.CrawlerRunner. This class is a thin wrapper that encapsulates some simple helpers to run multiple crawlers, but it won’t start or interfere with existing reactors in any way.
Using this class the reactor should be explicitly run after scheduling your spiders. It’s recommended you use CrawlerRunner instead of CrawlerProcess if your application is already using Twisted and you want to run Scrapy in the same reactor.
这是我的解决方案: