middlewares.py 在尝试使用多个用户代理时,scrapy 未按预期执行
middlewares.py in scrapy is not executed as expected while trying to use multiple user-agents
我正在尝试在我的 scrapy 项目中使用多个代理。我为 middlewares.py
here:
找到了这个脚本
import random
from scrapy.conf import settings
from myScrape.settings import USER_AGENT_LIST
import logging
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(USER_AGENT_LIST)
print('ua = %s' %ua)
if ua:
request.headers.setdefault('User-Agent', ua)
# check which ua is used
logging.debug(u'\n>>>>> User-Agent: %s\n' %request.headers)
而 USER_AGENT_LIST
在 settings.py
:
USER_AGENT_LIST = [
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) \
Chrome/16.0.912.36 Safari/535.7',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 \
(KHTML, like Gecko) Version/5.1.3 Safari/534.53.10'
]
DOWNLOADER_MIDDLEWARES = {
'myScrape.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
# Disable compression middleware, so the actual HTML pages are cached
}
但它并没有像我预期的那样工作。我在爬行时仍然看到 Scrapy 用户代理。 middlewares.py
中的打印函数被调用并显示了正确的 ua,但是日志输出给了 Scrapy 代理。
它是如何工作的?我需要以某种方式从我的蜘蛛脚本中调用它吗?
正如 eLRuLL 指出的那样,这是一个错字。我错过了 downloadermiddlewares
上的 's' 以获得 UserAgentMiddleware
的正确路径
我正在尝试在我的 scrapy 项目中使用多个代理。我为 middlewares.py
here:
import random
from scrapy.conf import settings
from myScrape.settings import USER_AGENT_LIST
import logging
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(USER_AGENT_LIST)
print('ua = %s' %ua)
if ua:
request.headers.setdefault('User-Agent', ua)
# check which ua is used
logging.debug(u'\n>>>>> User-Agent: %s\n' %request.headers)
而 USER_AGENT_LIST
在 settings.py
:
USER_AGENT_LIST = [
'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) \
Chrome/16.0.912.36 Safari/535.7',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 \
(KHTML, like Gecko) Version/5.1.3 Safari/534.53.10'
]
DOWNLOADER_MIDDLEWARES = {
'myScrape.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
# Disable compression middleware, so the actual HTML pages are cached
}
但它并没有像我预期的那样工作。我在爬行时仍然看到 Scrapy 用户代理。 middlewares.py
中的打印函数被调用并显示了正确的 ua,但是日志输出给了 Scrapy 代理。
它是如何工作的?我需要以某种方式从我的蜘蛛脚本中调用它吗?
正如 eLRuLL 指出的那样,这是一个错字。我错过了 downloadermiddlewares
上的 's' 以获得 UserAgentMiddleware