如何防止在抓取亚马逊时被列入黑名单

Question

我尝试通过 Scrapy 抓取亚马逊。但是我有这个错误

DEBUG: Retrying <GET http://www.amazon.fr/Amuses-bouche-Peuvent-b%C3%A9n%C3%A9ficier-dAmazon-Premium-Epicerie/s?ie=UTF8&page=1&rh=n%3A6356734031%2Cp_76%3A437878031> 
(failed 1 times): 503 Service Unavailable

我认为这是因为 = Amazon 非常擅长检测机器人。我怎样才能防止这种情况？

我在每次请求前都使用了time.sleep(6)。

我不想使用他们的API。

我试过使用 tor 和 polipo

Answer 1

你必须非常小心亚马逊并遵守亚马逊使用条款和与网络抓取相关的政策。

亚马逊非常擅长禁止机器人的 IP。您将不得不调整 DOWNLOAD_DELAY and CONCURRENT_REQUESTS to hit the website less often and be a good web-scraping citizen. And, you would need to rotate IP addresses (you may look into, for instance, crawlera) and user agents.

如何防止在抓取亚马逊时被列入黑名单

How to prevent getting blacklisted while scraping Amazon

amazon

web-crawler

scrapy

web-scraping

scrapy-spider