使用 scrapy-splash 的代理

Question

我正在尝试将代理 (proxymesh) 与 scrapy-splash 一起使用。我有以下（相关）代码

PROXY = """splash:on_request(function(request)
    request:set_proxy{
        host = http://us-ny.proxymesh.com,
        port = 31280,
        username = username,
        password = secretpass,
    }
    return splash:html()
end)"""

并在 start_requests

def start_requests(self):
    for url in self.start_urls:
        print url
        yield SplashRequest(url, self.parse,
            endpoint='execute',
            args={'wait': 5,
                  'lua_source': PROXY,
                  'js_source': 'document.body'},

但是好像不行。 self.parse 根本没有被调用。如果我将端点更改为 'render.html'，我会使用 self.parse 方法，但是当我检查 headers (response.headers) 时，我可以看到它没有通过代理。我确认当我将 http://checkip.dyndns.org/ 设置为起始 url 并在解析响应时看到我的旧 IP 地址。

我做错了什么？

Answer 1

您应该向 SplashRequest 对象添加 'proxy' 参数。

def start_requests(self):
    for url in self.start_urls:
        print url
        yield SplashRequest(url, self.parse,
            endpoint='execute',
            args={'wait': 5,
                  'lua_source': PROXY,
                  'js_source': 'document.body',
                  'proxy': 'http://proxy_ip:proxy_port'}

使用 scrapy-splash 的代理

using proxy with scrapy-splash

python

splash-screen

scrapy