当 运行 作为脚本时,Scrapy 爬虫会忽略 `DOWNLOADER_MIDDLEWARES`

Scrapy crawler ignores `DOWNLOADER_MIDDLEWARES` when run as a script

我想获取数据,使用Scrapy, from a few different sites and perform some analysis on that data. Since the both the crawlers and the code to analyze the data relate to the same project, I'd like to store everything in the same Git repository. I created a minimal reproducible example on Github

项目的结构如下所示:

./crawlers
./crawlers/__init__.py
./crawlers/myproject
./crawlers/myproject/__init__.py
./crawlers/myproject/myproject
./crawlers/myproject/myproject/__init__.py
./crawlers/myproject/myproject/items.py
./crawlers/myproject/myproject/pipelines.py
./crawlers/myproject/myproject/settings.py
./crawlers/myproject/myproject/spiders
./crawlers/myproject/myproject/spiders/__init__.py
./crawlers/myproject/myproject/spiders/example.py
./crawlers/myproject/scrapy.cfg
./scrapyScript.py

./crawlers/myproject 文件夹中,我可以通过键入以下内容来执行爬虫:

scrapy crawl example

爬虫使用了一些下载器中间件,具体来说,alecxe's excellent scrapy-fake-useragent。来自 settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}

当使用 scrapy crawl ... 执行时,用户代理看起来像一个真正的浏览器。这是来自网络服务器的示例记录:

24.8.42.44 - - [16/Jun/2015:05:07:59 +0000] "GET / HTTP/1.1" 200 27161 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"

查看 documentation,可以从脚本执行相当于 scrapy crawl ... 的操作。基于文档的 scrapyScript.py 文件如下所示:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals

from scrapy.utils.project import get_project_settings
from crawlers.myproject.myproject.spiders.example import ExampleSpider

spider = ExampleSpider()
settings = get_project_settings()

crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)

crawler.start()
log.start()
reactor.run()

当我执行脚本时,我可以看到爬虫发出页面请求。不幸的是,它忽略了 DOWNLOADER_MIDDLEWARES。例如,useragent 不再被欺骗:

24.8.42.44 - - [16/Jun/2015:05:32:04 +0000] "GET / HTTP/1.1" 200 27161 "-" "Scrapy/0.24.6 (+http://scrapy.org)"

不知何故,当从脚本执行爬虫时,它似乎忽略了 settings.py 中的设置。

你能看出我做错了什么吗?

为了get_project_settings()找到想要的settings.py,设置SCRAPY_SETTINGS_MODULEenvironment variable:

import os
import sys

# ...

sys.path.append(os.path.join(os.path.curdir, "crawlers/myproject"))
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myproject.settings'

settings = get_project_settings()

请注意,由于运行脚本的位置,您需要将 myproject 添加到 sys.path。或者,将 scrapyScript.py 移动到 myproject 目录下。