使用scrapy时如何避免被ban
How to avoid ban when uses scrapy
我经常被网站禁止,我在 scrapy 中设置 download_delay = 10,我尝试了一个包 fake_user_agent then I tried implementing tor and polipo, according to this site 配置没问题。但是在 运行 1/2 次之后我又被封禁了!有人能帮我一下吗 ?
注意:scrapy-proxie我也想试试这个但是激活不了
- 对点击使用延迟
- 不是 tor - 来自一个地址的所有连接 - 错误,多次访问后轮换代理
并检查这个 post - web scraping etiquette
你应该看看 documentation 说的是什么。
Here are some tips to keep in mind when dealing with these kinds of
sites:
rotate your user agent from a pool of well-known ones from browsers
(google around to get a list of them)
disable cookies (see
COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
use download delays (2 or higher). See DOWNLOAD_DELAY setting.
if
possible, use Google cache to fetch pages, instead of hitting the
sites directly use a pool of rotating IPs. For example, the free Tor
project or paid services like ProxyMesh
use a highly distributed
downloader that circumvents bans internally, so you can just focus on
parsing clean pages. One example of such downloaders is Crawlera
我经常被网站禁止,我在 scrapy 中设置 download_delay = 10,我尝试了一个包 fake_user_agent then I tried implementing tor and polipo, according to this site 配置没问题。但是在 运行 1/2 次之后我又被封禁了!有人能帮我一下吗 ?
注意:scrapy-proxie我也想试试这个但是激活不了
- 对点击使用延迟
- 不是 tor - 来自一个地址的所有连接 - 错误,多次访问后轮换代理
并检查这个 post - web scraping etiquette
你应该看看 documentation 说的是什么。
Here are some tips to keep in mind when dealing with these kinds of sites:
rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
use download delays (2 or higher). See DOWNLOAD_DELAY setting.
if possible, use Google cache to fetch pages, instead of hitting the sites directly use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh
use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera