寻找一种避免在爬行时被禁止的方法
Looking for a way to avoid getting banned while crawling
我在 Python 中对页面 https://www.instagram.com/explore/tags/some_hashtag/?__a=1
做了很多请求。这是代码:
def LoadUserAgents(uafile):
"""
uafile : string
path to text file of user agents, one per line
"""
uas = []
with open(uafile, 'rb') as uaf:
for ua in uaf.readlines():
if ua:
uas.append(ua.strip())
random.shuffle(uas)
return uas
address = f'https://www.instagram.com/explore/tags/{hashtag[1:]}/?__a=1'
uas = LoadUserAgents("user-agents.txt")
ua = random.choice(uas)
headers = {
"Connection" : "close",
"User-Agent" : ua}
r = requests.get(address, proxies=proxy, timeout=30, headers=headers)
文本文件 'user-agents.txt' 来自 here
变量proxy
的一个例子是proxy={'http': 'http://104.196.45.252:80'}
而且我仍然可以在日志中看到我经常被短时间禁止。
{'message': 'Please wait a few minutes before you try again.', 'status': 'fail'}
这样封禁后我马上换了代理和用户代理,但是下面的请求也显示我被封禁了
[Crawler @ 17_07_2018_15h29m34s]
Error message:{'message': 'Please wait a few minutes before you try again.', 'status': 'fail'}
Proxy:{'http': 'http://104.196.45.252:80'}
Header: {'Connection': 'close', 'User-Agent': b'Mozilla/5.0 (Windows; U; Windows NT 5.0; fr; rv:1.8.1.9pre) Gecko/20071102 Firefox/2.0.0.9 Navigator/9.0.0.3'}
[Crawler @ 17_07_2018_15h29m44s]
Error message: {'message': 'Please wait a few minutes before you try again.', 'status': 'fail'}
Proxy:{'http': 'http://52.77.242.220:80'}
Header: {'Connection': 'close', 'User-Agent': b'Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES; rv:1.7.3) Gecko/20040910'}
....
知道我做错了什么或者我应该添加什么来避免这个问题吗?
谢谢!
尝试为 https 流量提供代理 - 目前您提供的代理未被使用。
我在 Python 中对页面 https://www.instagram.com/explore/tags/some_hashtag/?__a=1
做了很多请求。这是代码:
def LoadUserAgents(uafile):
"""
uafile : string
path to text file of user agents, one per line
"""
uas = []
with open(uafile, 'rb') as uaf:
for ua in uaf.readlines():
if ua:
uas.append(ua.strip())
random.shuffle(uas)
return uas
address = f'https://www.instagram.com/explore/tags/{hashtag[1:]}/?__a=1'
uas = LoadUserAgents("user-agents.txt")
ua = random.choice(uas)
headers = {
"Connection" : "close",
"User-Agent" : ua}
r = requests.get(address, proxies=proxy, timeout=30, headers=headers)
文本文件 'user-agents.txt' 来自 here
变量proxy
的一个例子是proxy={'http': 'http://104.196.45.252:80'}
而且我仍然可以在日志中看到我经常被短时间禁止。
{'message': 'Please wait a few minutes before you try again.', 'status': 'fail'}
这样封禁后我马上换了代理和用户代理,但是下面的请求也显示我被封禁了
[Crawler @ 17_07_2018_15h29m34s]
Error message:{'message': 'Please wait a few minutes before you try again.', 'status': 'fail'}
Proxy:{'http': 'http://104.196.45.252:80'}
Header: {'Connection': 'close', 'User-Agent': b'Mozilla/5.0 (Windows; U; Windows NT 5.0; fr; rv:1.8.1.9pre) Gecko/20071102 Firefox/2.0.0.9 Navigator/9.0.0.3'}
[Crawler @ 17_07_2018_15h29m44s]
Error message: {'message': 'Please wait a few minutes before you try again.', 'status': 'fail'}
Proxy:{'http': 'http://52.77.242.220:80'}
Header: {'Connection': 'close', 'User-Agent': b'Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES; rv:1.7.3) Gecko/20040910'}
....
知道我做错了什么或者我应该添加什么来避免这个问题吗?
谢谢!
尝试为 https 流量提供代理 - 目前您提供的代理未被使用。