使用scrapy时如何避免被ban

How to avoid ban when uses scrapy

我经常被网站禁止,我在 scrapy 中设置 download_delay = 10,我尝试了一个包 fake_user_agent then I tried implementing tor and polipo, according to this site 配置没问题。但是在 运行 1/2 次之后我又被封禁了!有人能帮我一下吗 ?

注意:scrapy-proxie我也想试试这个但是激活不了

  1. 对点击使用延迟
  2. 不是 tor - 来自一个地址的所有连接 - 错误,多次访问后轮换代理

并检查这个 post - web scraping etiquette

你应该看看 documentation 说的是什么。

Here are some tips to keep in mind when dealing with these kinds of sites:

  • rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)

  • disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour

  • use download delays (2 or higher). See DOWNLOAD_DELAY setting.

  • if possible, use Google cache to fetch pages, instead of hitting the sites directly use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh

  • use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera