Python:httplib 和请求的问题; https 似乎导致重定向然后 BadStatusLine 异常
Python: Issues with httplib & requests; https seems to cause a redirect then BadStatusLine exception
我目前正在尝试使用 BeautifulSoup 从 discogs 网站上抓取一些信息,这些信息无法通过他们的 API 获得。不幸的是,如果没有 运行 进入 BadStatusLine 异常,我似乎无法通过 urllib2
、httplib
或 requests
连接到该站点。
我认为这是由于对 http://www.discogs.com
的任何请求都被重定向到 https://www.discogs.com
。我已经能够通过使用以下代码确定方向:
r_link = "http://www.discogs.com"
print "Trying " + r_link
r = requests.get(r_link, allow_redirects=False)
print(r.status_code, r.reason, r.history, r.headers['Location'])
这个returns:
Trying http://www.discogs.com
(301, 'Moved Permanently', [], 'https://www.discogs.com/')
如果我理解正确的话,这意味着对 http://www.discogs.com
的任何请求都将被重定向到 https://www.discogs.com
。因此,人们会认为显而易见的解决方案是直接向 https://www.discogs.com
提出请求。好吧,不幸的是,使用上面的代码(即将 s 添加到 r_link 路径)会导致 BadStatusCode 错误...
Trying https://www.discogs.com
Traceback (most recent call last):
File "start.py", line 26, in <module>
r = requests.get(r_link, allow_redirects=False)
File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 426, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))
根据 requests
文档中的示例,我应该可以毫无问题地处理 https link。实际上,使用 https://www.google.com
尝试上述代码会导致 302
响应,并且在 r.headers['Location']
.
中使用 url 时会成功重定向
那么问题是什么?为什么会这样?这是因为我犯了一个错误吗?这可能是我的 device/set up 特有的东西吗?这是 discogs 服务器特有的东西吗?我不知道如何诊断这个问题。
谢谢。
添加用户代理,请求将正常工作:
h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
r_link = "https://www.discogs.com"
print ("Trying " + r_link)
r = requests.get(r_link,headers=h)
print(r.status_code, r.reason, r.history, r.headers)
print(r.content)
下面的工作示例:
In [19]: h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
In [20]: r_link = "https://www.discogs.com"
In [21]: r = requests.get(r_link, headers=h)
In [22]: print(r.status_code, r.reason, r.history, r.headers)
(200, 'OK', [], {'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'sid=fad997b268420522ac0242de41fc694c; Domain=www.discogs.com; Expires=Sun, 19-Apr-2026 17:04:09 GMT; Path=/, language2=en; Domain=www.discogs.com; Path=/, session="9H1LFLTWiCMSowA7nKbUYlHU4N8=?"; Domain=www.discogs.com; Secure; HttpOnly; Path=/', 'Server': 'nginx/1.8.1', 'Connection': 'keep-alive', 'Date': 'Thu, 21 Apr 2016 17:04:10 GMT', 'Content-Type': 'text/html; charset=utf-8'})
In [23]: from bs4 import BeautifulSoup
In [24]: soup.select("#email")
Out[24]: [<input autocaptialize="off" autocomplete="off" id="email" name="email" placeholder="Enter your email address" type="text"/>]
In [25]: soup.select("#username")
Out[25]: [<input autocaptialize="off" autocomplete="off" id="username" name="username" placeholder="Choose a username" type="text"/>]
如果您要登录:
h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
login = "https://www.discogs.com/login?return_to=%2F"
with requests.session() as s:
r = s.post(login, data={"username":"your_user","password":"your_pass","Action.Login":""}, headers=h)
print(r.content)
如果我们 运行 它你看我们得到 https://www.discogs.com/my
:
In [27]: h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
In [28]: login = "https://www.discogs.com/login?return_to=%2F"
In [29]: with requests.session() as s:
....: r = s.post(login, data={"username":"xxxxxxxx","password":"xxxxxxxx","Action.Login":""}, headers=h)
....: print(r.url)
....:
https://www.discogs.com/my
我目前正在尝试使用 BeautifulSoup 从 discogs 网站上抓取一些信息,这些信息无法通过他们的 API 获得。不幸的是,如果没有 运行 进入 BadStatusLine 异常,我似乎无法通过 urllib2
、httplib
或 requests
连接到该站点。
我认为这是由于对 http://www.discogs.com
的任何请求都被重定向到 https://www.discogs.com
。我已经能够通过使用以下代码确定方向:
r_link = "http://www.discogs.com"
print "Trying " + r_link
r = requests.get(r_link, allow_redirects=False)
print(r.status_code, r.reason, r.history, r.headers['Location'])
这个returns:
Trying http://www.discogs.com
(301, 'Moved Permanently', [], 'https://www.discogs.com/')
如果我理解正确的话,这意味着对 http://www.discogs.com
的任何请求都将被重定向到 https://www.discogs.com
。因此,人们会认为显而易见的解决方案是直接向 https://www.discogs.com
提出请求。好吧,不幸的是,使用上面的代码(即将 s 添加到 r_link 路径)会导致 BadStatusCode 错误...
Trying https://www.discogs.com
Traceback (most recent call last):
File "start.py", line 26, in <module>
r = requests.get(r_link, allow_redirects=False)
File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 426, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))
根据 requests
文档中的示例,我应该可以毫无问题地处理 https link。实际上,使用 https://www.google.com
尝试上述代码会导致 302
响应,并且在 r.headers['Location']
.
那么问题是什么?为什么会这样?这是因为我犯了一个错误吗?这可能是我的 device/set up 特有的东西吗?这是 discogs 服务器特有的东西吗?我不知道如何诊断这个问题。
谢谢。
添加用户代理,请求将正常工作:
h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
r_link = "https://www.discogs.com"
print ("Trying " + r_link)
r = requests.get(r_link,headers=h)
print(r.status_code, r.reason, r.history, r.headers)
print(r.content)
下面的工作示例:
In [19]: h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
In [20]: r_link = "https://www.discogs.com"
In [21]: r = requests.get(r_link, headers=h)
In [22]: print(r.status_code, r.reason, r.history, r.headers)
(200, 'OK', [], {'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'sid=fad997b268420522ac0242de41fc694c; Domain=www.discogs.com; Expires=Sun, 19-Apr-2026 17:04:09 GMT; Path=/, language2=en; Domain=www.discogs.com; Path=/, session="9H1LFLTWiCMSowA7nKbUYlHU4N8=?"; Domain=www.discogs.com; Secure; HttpOnly; Path=/', 'Server': 'nginx/1.8.1', 'Connection': 'keep-alive', 'Date': 'Thu, 21 Apr 2016 17:04:10 GMT', 'Content-Type': 'text/html; charset=utf-8'})
In [23]: from bs4 import BeautifulSoup
In [24]: soup.select("#email")
Out[24]: [<input autocaptialize="off" autocomplete="off" id="email" name="email" placeholder="Enter your email address" type="text"/>]
In [25]: soup.select("#username")
Out[25]: [<input autocaptialize="off" autocomplete="off" id="username" name="username" placeholder="Choose a username" type="text"/>]
如果您要登录:
h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
login = "https://www.discogs.com/login?return_to=%2F"
with requests.session() as s:
r = s.post(login, data={"username":"your_user","password":"your_pass","Action.Login":""}, headers=h)
print(r.content)
如果我们 运行 它你看我们得到 https://www.discogs.com/my
:
In [27]: h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
In [28]: login = "https://www.discogs.com/login?return_to=%2F"
In [29]: with requests.session() as s:
....: r = s.post(login, data={"username":"xxxxxxxx","password":"xxxxxxxx","Action.Login":""}, headers=h)
....: print(r.url)
....:
https://www.discogs.com/my