Python:httplib 和请求的问题; https 似乎导致重定向然后 BadStatusLine 异常

Python: Issues with httplib & requests; https seems to cause a redirect then BadStatusLine exception

我目前正在尝试使用 BeautifulSoup 从 discogs 网站上抓取一些信息,这些信息无法通过他们的 API 获得。不幸的是,如果没有 运行 进入 BadStatusLine 异常,我似乎无法通过 urllib2httplibrequests 连接到该站点。

我认为这是由于对 http://www.discogs.com 的任何请求都被重定向到 https://www.discogs.com。我已经能够通过使用以下代码确定方向:

r_link = "http://www.discogs.com"
print "Trying " + r_link
r = requests.get(r_link, allow_redirects=False)
print(r.status_code, r.reason, r.history, r.headers['Location'])

这个returns:

Trying http://www.discogs.com
(301, 'Moved Permanently', [], 'https://www.discogs.com/')

如果我理解正确的话,这意味着对 http://www.discogs.com 的任何请求都将被重定向到 https://www.discogs.com。因此,人们会认为显而易见的解决方案是直接向 https://www.discogs.com 提出请求。好吧,不幸的是,使用上面的代码(即将 s 添加到 r_link 路径)会导致 BadStatusCode 错误...

Trying https://www.discogs.com
Traceback (most recent call last):
  File "start.py", line 26, in <module>
    r = requests.get(r_link, allow_redirects=False)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 67, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 426, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))

根据 requests 文档中的示例,我应该可以毫无问题地处理 https link。实际上,使用 https://www.google.com 尝试上述代码会导致 302 响应,并且在 r.headers['Location'].

中使用 url 时会成功重定向

那么问题是什么?为什么会这样?这是因为我犯了一个错误吗?这可能是我的 device/set up 特有的东西吗?这是 discogs 服务器特有的东西吗?我不知道如何诊断这个问题。

谢谢。

添加用户代理,请求将正常工作:

h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}
r_link = "https://www.discogs.com"
print ("Trying " + r_link)
r = requests.get(r_link,headers=h)
print(r.status_code, r.reason, r.history, r.headers)
print(r.content)

下面的工作示例:

In [19]: h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}


In [20]: r_link = "https://www.discogs.com"

In [21]: r = requests.get(r_link, headers=h)

In [22]: print(r.status_code, r.reason, r.history, r.headers)
(200, 'OK', [], {'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'sid=fad997b268420522ac0242de41fc694c; Domain=www.discogs.com; Expires=Sun, 19-Apr-2026 17:04:09 GMT; Path=/, language2=en; Domain=www.discogs.com; Path=/, session="9H1LFLTWiCMSowA7nKbUYlHU4N8=?"; Domain=www.discogs.com; Secure; HttpOnly; Path=/', 'Server': 'nginx/1.8.1', 'Connection': 'keep-alive', 'Date': 'Thu, 21 Apr 2016 17:04:10 GMT', 'Content-Type': 'text/html; charset=utf-8'})

In [23]: from bs4 import  BeautifulSoup

In [24]: soup.select("#email")
Out[24]: [<input autocaptialize="off" autocomplete="off" id="email" name="email" placeholder="Enter your email address" type="text"/>]

In [25]: soup.select("#username")
Out[25]: [<input autocaptialize="off" autocomplete="off" id="username" name="username" placeholder="Choose a username" type="text"/>]

如果您要登录:

h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}


login = "https://www.discogs.com/login?return_to=%2F"
with requests.session() as s:
    r = s.post(login, data={"username":"your_user","password":"your_pass","Action.Login":""}, headers=h)
    print(r.content)

如果我们 运行 它你看我们得到 https://www.discogs.com/my:

In [27]: h = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}

In [28]: login = "https://www.discogs.com/login?return_to=%2F"

In [29]: with requests.session() as s:
   ....:         r = s.post(login, data={"username":"xxxxxxxx","password":"xxxxxxxx","Action.Login":""}, headers=h)
   ....:         print(r.url)
   ....:     
https://www.discogs.com/my