被 robots.txt 禁止：scrapy

Question

在抓取像 https://www.netflix.com 这样的网站时，被 robots.txt 禁止访问：https://www.netflix.com/>

错误：未下载响应：https://www.netflix.com/

Answer 1

首先您需要确保在请求中更改您的用户代理，否则默认用户代理肯定会被阻止。

Answer 2

在2016-05-11推出的新版本（scrapy 1.1）中抓取先下载robots.txt再抓取。要更改 settings.py 中的此行为，请使用 ROBOTSTXT_OBEY

ROBOTSTXT_OBEY = False

这是release notes

Answer 3

Netflix 的使用条款状态：

You also agree not to circumvent, remove, alter, deactivate, degrade or thwart any of the content protections in the Netflix service; use any robot, spider, scraper or other automated means to access the Netflix service;

他们 robots.txt 设置为阻止网络抓取工具。如果您将 settings.py 中的设置覆盖为 ROBOTSTXT_OBEY=False，则您违反了他们的使用条款，这可能会导致诉讼。

被 robots.txt 禁止：scrapy

getting Forbidden by robots.txt: scrapy

python

web-crawler

scrapy