使用 google 的 ip 而不是域名时出现 TooManyRedirects
TooManyRedirects when using google's ip instead of domain name
我正在尝试抓取 google 搜索结果,当我使用这样的域名时一切正常:
import requests
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
requests.get('https://google.com/search?q={}'.format('movie'),\
verify=False, headers={'User-Agent': user_agent})
但是当我使用IP爬取的时候google:
requests.get('https://216.58.207.78/search?q={}'.format('movie'),\
verify=False, headers={'User-Agent': user_agent, 'host': 'google.com'})
出现以下错误:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 668, in send
history = [resp for resp in gen] if allow_redirects else []
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 668, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 165, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
我该如何解决?
通过将 www.
添加到您的 Host
:
来修复它
requests.get('https://216.58.207.78/search?q={}'.format('movie'),\
verify=False, headers={'User-Agent': user_agent, 'host': 'www.google.com'})
解释:
发生这种情况是因为您在 Host
HTTP header.
中使用了 google.com
当 google 收到您的请求时,它发现您在 HTTP header 中期待 google.com
,因此他们将您重定向到 www.google.com
。但是当请求遵循重定向时,它会发送与您请求的相同的 header,Host
中包含 google.com
。所以服务器再次重定向你等等。
您也可以只删除 Host
header,据我所知,这没有什么区别。
我正在尝试抓取 google 搜索结果,当我使用这样的域名时一切正常:
import requests
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
requests.get('https://google.com/search?q={}'.format('movie'),\
verify=False, headers={'User-Agent': user_agent})
但是当我使用IP爬取的时候google:
requests.get('https://216.58.207.78/search?q={}'.format('movie'),\
verify=False, headers={'User-Agent': user_agent, 'host': 'google.com'})
出现以下错误:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 668, in send
history = [resp for resp in gen] if allow_redirects else []
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 668, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "/home/mohammad/myfiles/gitRepo/telesearch/env/lib/python3.6/site-packages/requests/sessions.py", line 165, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
我该如何解决?
通过将 www.
添加到您的 Host
:
requests.get('https://216.58.207.78/search?q={}'.format('movie'),\
verify=False, headers={'User-Agent': user_agent, 'host': 'www.google.com'})
解释:
发生这种情况是因为您在 Host
HTTP header.
google.com
当 google 收到您的请求时,它发现您在 HTTP header 中期待 google.com
,因此他们将您重定向到 www.google.com
。但是当请求遵循重定向时,它会发送与您请求的相同的 header,Host
中包含 google.com
。所以服务器再次重定向你等等。
您也可以只删除 Host
header,据我所知,这没有什么区别。