运行 带有代理的 scrapy splash

Running scrapy splash with proxies

我在scrapy splash中使用代理,但我一直得到502代理,这困扰了我好几天。

我下载的中间件:

class ABProxyMiddleware(HttpProxyMiddleware):
""" 阿布云ip代理配置 """
proxyAuth = "Basic " + base64.urlsafe_b64encode(
    bytes((settings['PROXY_USER'] + ":" + settings['PROXY_PASS']), "ascii")).decode("utf-8")

def process_request(self, request, spider):
    request.meta['splash']['args']['proxy'] = settings['PROXY_SERVER']
    request.headers['Proxy-Authorization'] = self.proxyAuth

我的要求:

yield SplashRequest(url= 'http://www.qidian.com/all?chanId=4&subCateId=130&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=' + str(
                i),callback=self.book_parse, endpoint='render.html')

我的设置:

DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'tempScrapy.middlewares.ABProxyMiddleware': 100,

}

我确定所有关于代理的设置都是正确的,并且代理是有效的,因为它会成功而不会出现问题

根据您的代码,您正在向 Splash 服务器发送代理身份验证 header:

+-------------+
| Your spider |
+------+------+
       |
       | Proxy Authentication
       v
+------+-------+
|   Splash     |
+------+-------+
       |
       |
       v
+------+-------+
| Proxy server |
+------+-------+
       |
       |
       v
+------+-------+
| Target site  |
+--------------+

Splash 服务器会简单地忽略您发送的代理身份验证 header,因此代理服务器会因身份验证不成功而拒绝您的请求。

正确的做法是让 Splash 发送代理身份验证 header:

+-------------+
| Your spider |
+------+------+
       |
       |
       v
+------+-------+
|   Splash     |
+------+-------+
       |
       | Proxy Authentication
       v
+------+-------+
| Proxy server |
+------+-------+
       |
       |
       v
+------+-------+
| Target site  |
+--------------+

因此您需要删除这一行:

request.headers['Proxy-Authorization'] = self.proxyAuth

并正确配置代理信息:

request.meta['splash']['args']['proxy'] = 'proxy info of format: [protocol://][user:password@]proxyhost[:port]'

另请参阅:API reference of Splash(查找 proxy 参数)