在加载 settings.py 之前抓取运行代码

Question

我有一个使用代理的网络爬虫。我有一个生成 100 个有效代理列表的脚本，然后我将该列表设置为 settings.py 中的代理源。我的问题是，目前我手动运行生成该文件的脚本，然后我运行爬虫。

如果我希望在 settings.py 被“处理”之前运行，有人知道我会将代码放在哪里吗？我不想在运行爬虫之前手动必须运行该脚本，因为我希望它是独立的。 ROTATING_PROXY_LIST_PATH = 'C:\Users\cmdan\Desktop\Spiders\Michael Mitarotonda\proxies.txt'

提前致谢！

Answer 1

文档向 Run Scrapy from a script 解释了该方法。这意味着它应该允许您在运行您的爬虫之前执行一些其他操作，例如您的代理脚本。

您可能希望在此脚本中定义您的爬虫，或者您可能希望导入您的爬虫，两者都可以。

import scrapy
from scrapy.crawler import CrawlerProcess

# if you want to import your spider
# from project.spiders import myspider

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

# here comes your script, setting the value of
# ROTATING_PROXY_LIST_PATH

process = CrawlerProcess(settings={
    "FEEDS": {
        "items.json": {"format": "json"},
    },
    "ROTATING_PROXY_LIST_PATH": "path-to-file",
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

在加载 settings.py 之前抓取运行代码

Scrapy Run code before loading settings.py

python

web-crawler

scrapy

web-scraping

在加载 settings.py 之前抓取 运行 代码

Scrapy Run code before loading settings.py

python

web-crawler

scrapy

web-scraping

在加载 settings.py 之前抓取运行代码