为什么 Scrapy 无法在我配置 NTLM 中间件后获取我的 URL?
Why is Scrapy unable to get my URL once I configure my NTLM middleware?
我在打开我的 NTLM 下载器中间件之前尝试抓取 yahoo.com,它运行良好。但是,现在我的下载器中间件已在设置中打开,我收到一条错误消息“错误:下载错误。
settings.py
BOT_NAME = 'demo'
SPIDER_MODULES = ['demo.spiders']
NEWSPIDER_MODULE = 'demo.spiders'
DOWNLOADER_MIDDLEWARES = { 'demo.ntlmauth.NtlmAuthMiddleware': 800, }
ITEM_PIPELINES = [
'scrapysolr.SolrPipeline',
]
SOLR_URL = 'solr_url'
SOLR_MAPPING = {
'id': 'url',
'text': ['title', 'breadcrumbs', 'description'],
'description': 'description',
'keywords': 'breadcrumbs',
'price': 'price',
'title': 'title'
}
ntlmauth.py。也可以找到此代码 here.
import os
import urllib2
from ntlm import HTTPNtlmAuthHandler
from scrapy.http import TextResponse
class NtlmAuthMiddleware(object):
def process_request(self, request, spider):
usr = '%s\%s' % (os.environ["USERDOMAIN"], getattr(spider,'http_user', ''))
pwd = getattr(spider, 'http_pass', '')
url = request.url
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, usr, pwd)
# Create the NTLM authentication handler.
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)
# Create and install the opener.
opener = urllib2.build_opener(auth_NTLM)
urllib2.install_opener(opener)
# Retrieve the result.
resp = urllib2.urlopen(url)
msg = resp.info()
return HtmlResponse(url=url, status=resp.getcode(), headers=msg.items(), body=resp.read())
demo_Spider.py
import scrapy
class DemoSpider(scrapy.Spider):
http_user = 'DOMAIN\USER'
http_pass = 'PASSWORD'
name = "demo"
allowed_domains = ["yahoo.com"]
start_urls = [
"https://www.yahoo.com/" ]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
here 是我遇到的错误!
看一下ntlm中间件的第9行:
usr = '%s\%s' % (os.environ["USERDOMAIN"], getattr(spider,'http_user', ''))
出现错误是因为没有设置环境变量 USERDOMAIN
。
在您当前的代码中,usr
的值将是 'OsUserDomain\DOMAIN\USER',这可能不是您想要的(没有意义)。我建议您修改蜘蛛程序或中间件,以使用正确的 'domain\user' 格式。
我在打开我的 NTLM 下载器中间件之前尝试抓取 yahoo.com,它运行良好。但是,现在我的下载器中间件已在设置中打开,我收到一条错误消息“错误:下载错误。
settings.py
BOT_NAME = 'demo'
SPIDER_MODULES = ['demo.spiders']
NEWSPIDER_MODULE = 'demo.spiders'
DOWNLOADER_MIDDLEWARES = { 'demo.ntlmauth.NtlmAuthMiddleware': 800, }
ITEM_PIPELINES = [
'scrapysolr.SolrPipeline',
]
SOLR_URL = 'solr_url'
SOLR_MAPPING = {
'id': 'url',
'text': ['title', 'breadcrumbs', 'description'],
'description': 'description',
'keywords': 'breadcrumbs',
'price': 'price',
'title': 'title'
}
ntlmauth.py。也可以找到此代码 here.
import os
import urllib2
from ntlm import HTTPNtlmAuthHandler
from scrapy.http import TextResponse
class NtlmAuthMiddleware(object):
def process_request(self, request, spider):
usr = '%s\%s' % (os.environ["USERDOMAIN"], getattr(spider,'http_user', ''))
pwd = getattr(spider, 'http_pass', '')
url = request.url
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, usr, pwd)
# Create the NTLM authentication handler.
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)
# Create and install the opener.
opener = urllib2.build_opener(auth_NTLM)
urllib2.install_opener(opener)
# Retrieve the result.
resp = urllib2.urlopen(url)
msg = resp.info()
return HtmlResponse(url=url, status=resp.getcode(), headers=msg.items(), body=resp.read())
demo_Spider.py
import scrapy
class DemoSpider(scrapy.Spider):
http_user = 'DOMAIN\USER'
http_pass = 'PASSWORD'
name = "demo"
allowed_domains = ["yahoo.com"]
start_urls = [
"https://www.yahoo.com/" ]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
here 是我遇到的错误!
看一下ntlm中间件的第9行:
usr = '%s\%s' % (os.environ["USERDOMAIN"], getattr(spider,'http_user', ''))
出现错误是因为没有设置环境变量 USERDOMAIN
。
在您当前的代码中,usr
的值将是 'OsUserDomain\DOMAIN\USER',这可能不是您想要的(没有意义)。我建议您修改蜘蛛程序或中间件,以使用正确的 'domain\user' 格式。