如何在 Scrapy 中绕过 cloudflare bot/ddos 保护？

Question

我曾经偶尔抓取电子商务网页以获取产品价格信息。我已经有一段时间没有使用使用 Scrapy 构建的爬虫了，昨天我正在尝试使用它 - 我运行遇到了机器人保护问题。

它正在使用 CloudFlare 的 DDOS 保护，它基本上是使用 JavaScript 评估来过滤掉禁用 JS 的浏览器（因此也包括爬虫）。评估函数后，将生成具有计算数字的响应。在 return 中，服务发回两个附加到每个请求的身份验证 cookie，允许正常抓取站点。 Here 是对其工作原理的描述。

不过我也找到了 cloudflare-scrape Python module that uses external JS evaluation engine to calculate the number and send the request back to server. I'm not sure how to integrate it into Scrapy。或者也许有不使用 JS 执行的更聪明的方法？最后，它是一种形式...

我愿意提供任何帮助。

Answer 1

显然最好的方法是在 CloudFlare 中将您的 IP 列入白名单；如果这不合适，让我将 cloudflare-scrape library. You can use this to get the cookie token, then provide this cookie token in your Scrapy request 推荐回服务器。

Answer 2

所以我在 cloudflare-scrape 的帮助下使用 Python 执行了 JavaScript。

在你的抓取工具中，你需要添加以下代码：

def start_requests(self):
  for url in self.start_urls:
    token, agent = cfscrape.get_tokens(url, 'Your prefarable user agent, _optional_')
    yield Request(url=url, cookies=token, headers={'User-Agent': agent})

以及解析函数。就是这样！

当然，你需要先安装cloudflare-scrape，然后导入到你的爬虫中。您还需要安装一个 JS 执行引擎。我已经 Node.JS，没有抱怨。

Answer 3

如果您遇到 503 错误，您可以遵循以下准则：

转到settings.py
搜索：USER_AGENT

在这里您将看到 scrapy 的默认机器人用户代理。将默认值替换为：

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'

如何在 Scrapy 中绕过 cloudflare bot/ddos 保护？

How to bypass cloudflare bot/ddos protection in Scrapy?

javascript

python

cookies

scrapy