"SSL: certificate_verify_failed" 抓取 https://www.thenewboston.com/ 时出错

"SSL: certificate_verify_failed" error when scraping https://www.thenewboston.com/

所以我最近开始学习 Python 使用 youtube 上的 "The New Boston's" 视频,一切都很顺利,直到我看到他制作简单网络爬虫的教程。虽然我理解它没有问题,但当我 运行 代码时,我得到的错误似乎都是基于 "SSL: CERTIFICATE_VERIFY_FAILED." 自昨晚以来我一直在寻找答案,试图弄清楚如何修复它,它似乎在视频或他的网站上的评论中没有其他人遇到与我相同的问题,甚至使用他网站上的其他人代码我也得到相同的结果。我会 post 我从网站上得到的代码,因为它给了我同样的错误,而我编码的代码现在一团糟。

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=" + str(page) #this is page of popular posts
        source_code = requests.get(url)
        # just get the code, no headers or anything
        plain_text = source_code.text
        # BeautifulSoup objects can be sorted through easy
        for link in soup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
            href = "https://www.thenewboston.com/" + link.get('href')
            title = link.string # just the text, not the HTML
            print(href)
            print(title)
            # get_single_item_data(href)
    page += 1
trade_spider(1)

完整的错误是:ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)

如果这是一个愚蠢的问题,我深表歉意,我还是编程的新手,但我真的想不通,我正在考虑跳过本教程,但无法解决这个问题让我很困扰,谢谢!

我发布这个作为答案是因为到目前为止我已经解决了你的问题,但你的代码中仍然存在问题(修复后,我可以更新)。

长话短说:您可能使用了旧版本的请求,或者 ssl 证书无效。这个 SO 问题中有更多信息:Python requests "certificate verify failed"

我已将代码更新到我自己的 bsoup.py 文件中:

#!/usr/bin/env python3

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=" + str(page) #this is page of popular posts
        source_code = requests.get(url, timeout=5, verify=False)
        # just get the code, no headers or anything
        plain_text = source_code.text
        # BeautifulSoup objects can be sorted through easy
        for link in BeautifulSoup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
            href = "https://www.thenewboston.com/" + link.get('href')
            title = link.string # just the text, not the HTML
            print(href)
            print(title)
            # get_single_item_data(href)

        page += 1

if __name__ == "__main__":
    trade_spider(1)

当我 运行 脚本时,它给我这个错误:

https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=1
Traceback (most recent call last):
  File "./bsoup.py", line 26, in <module>
    trade_spider(1)
  File "./bsoup.py", line 16, in trade_spider
    for link in BeautifulSoup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
  File "/usr/local/lib/python3.4/dist-packages/bs4/element.py", line 1256, in find_all
    generator = self.descendants
AttributeError: 'str' object has no attribute 'descendants'

您的 findAll 方法存在问题。我同时使用了 python3 和 python2,其中 python2 报告:

TypeError: unbound method find_all() must be called with BeautifulSoup instance as first argument (got str instance instead)

看来您需要先修正该方法才能继续

您可以告诉请求不要验证 SSL 证书:

>>> url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=1"
>>> response = requests.get(url, verify=False)
>>> response.status_code
200

requests doc

中查看更多内容

问题不在您的代码中,而在您尝试访问的网站中。查看 analysis by SSLLabs 时,您会注意到:

This server's certificate chain is incomplete. Grade capped to B.

这意味着服务器配置有误,不仅python而且其他几个网站也会出现问题。某些桌面浏览器通过尝试从 Internet 加载丢失的证书或填充缓存的证书来解决此配置问题。但其他浏览器或应用程序也会失败,类似于 python.

要解决损坏的服务器配置,您可以明确提取丢失的证书并将它们添加到您的信任库中。或者您可以在 verify 参数中将证书作为信任提供。来自 the documentation:

You can pass verify the path to a CA_BUNDLE file or directory with certificates of trusted CAs:

>>> requests.get('https://github.com', verify='/path/to/certfile') 

This list of trusted CAs can also be specified through the REQUESTS_CA_BUNDLE environment variable.

您的系统中可能缺少股票证书。例如。如果 运行 在 Ubuntu 上,请检查是否安装了 ca-certificates 包。

如果您想使用 Python dmg 安装程序,您还必须阅读 Python 3 的自述文件和 运行 bash 命令以获取新证书。

尝试运行宁

/Applications/Python\ 3.6/Install\ Certificates.command

我花了几个小时试图修复一些 Python 并更新 VM 上的证书。在我的例子中,我使用的是其他人设置的服务器。原来是错误的证书上传到了服务器。我在另一个 SO 答案上找到了这个命令。

root@ubuntu:~/cloud-tools# openssl s_client -connect abc.def.com:443
CONNECTED(00000005)
depth=0 OU = Domain Control Validated, CN = abc.def.com
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 OU = Domain Control Validated, CN = abc.def.com
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
0 s:OU = Domain Control Validated, CN = abc.def.com
   i:C = US, ST = Arizona, L = Scottsdale, O = "GoDaddy.com, Inc.", OU = http://certs.godaddy.com/repository/, CN = Go Daddy Secure Certificate Authority - G2

值得多说一点“hands-on”关于这里发生的事情,加上@Steffen Ullrich 在这里和其他地方的回答:

  • (很详细的回答)

备注:

  • 我会使用其他网站而不是 OP,因为 OP 的网站目前没有问题。
  • 我使用 Ubunto 运行 以下命令(curlopenssl)。我在 Windows 10 上尝试了 运行ning curl,但得到了不同的、无用的输出。

OP 遇到的错误可以通过使用以下 curl 命令“重现”:

curl -vvI https://www.vimmi.net

输出(注意最后一行):

* TCP_NODELAY set
* Connected to www.vimmi.net (82.80.192.7) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (OUT), TLS alert, Server hello (2):
* SSL certificate problem: unable to get local issuer certificate
* stopped the pause stream!
* Closing connection 0
curl: (60) SSL certificate problem: unable to get local issuer certificate

现在让我们 运行 它带有 --insecure 标志,这将显示有问题的证书:

curl --insecure -vvI https://www.vimmi.net

输出(注意最后两行):

* Rebuilt URL to: https://www.vimmi.net/
*   Trying 82.80.192.7...
* TCP_NODELAY set
* Connected to www.vimmi.net (82.80.192.7) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* [...]
* Server certificate:
*  subject: OU=Domain Control Validated; CN=vimmi.net
*  start date: Aug  5 15:43:45 2019 GMT
*  expire date: Oct  4 16:16:12 2020 GMT
*  issuer: C=US; ST=Arizona; L=Scottsdale; O=GoDaddy.com, Inc.; OU=http://certs.godaddy.com/repository/; CN=Go Daddy Secure Certificate Authority - G2
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.

使用openssl可以看到相同的结果,值得一提,因为它被python内部使用:

echo | openssl s_client -connect vimmi.net:443

输出:

CONNECTED(00000005)
depth=0 OU = Domain Control Validated, CN = vimmi.net
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 OU = Domain Control Validated, CN = vimmi.net
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
 0 s:OU = Domain Control Validated, CN = vimmi.net
   i:C = US, ST = Arizona, L = Scottsdale, O = "GoDaddy.com, Inc.", OU = http://certs.godaddy.com/repository/, CN = Go Daddy Secure Certificate Authority - G2
---
Server certificate
-----BEGIN CERTIFICATE-----
[...]
-----END CERTIFICATE-----
[...]
---
DONE

那么为什么curlopenssl都无法验证GoDaddy为该网站颁发的证书呢?

嗯,“验证证书”(使用 openssl 的错误消息术语)意味着 验证证书包含可信源签名(换句话说:证书已签名由受信任的来源),从而验证 vimmi.net 身份(这里的“身份” strictly means that “证书中包含的 public 密钥属于个人、组织、服务器或其他在证书中注明的实体证书”).

如果我们可以建立其“信任链”,则该来源是“可信的”,with the following properties:

  1. The Issuer of each certificate (except the last one) matches the Subject of the next certificate in the list
  2. Each certificate (except the last one) is signed by the secret key corresponding to the next certificate in the chain (i.e. the signature of one certificate can be verified using the public key contained in the following certificate)
  3. The last certificate in the list is a trust anchor: a certificate that you trust because it was delivered to you by some trustworthy procedure

在我们的例子中,颁发者是“Go Daddy Secure Certificate Authority - G2”。也就是说,名为“Go Daddy Secure Certificate Authority - G2”的实体签署了证书,因此它应该是可信来源。

要建立该实体的可信度,我们有 2 个选项:

  1. 假设“Go Daddy Secure Certificate Authority - G2”是一个“信任锚”(见上面的清单 3)。好吧,事实证明 curlopenssl 试图根据这个假设行事:他们在默认路径(称为 CA 路径)上搜索该实体的证书,这些路径是:

    • 对于 curl,它是 /etc/ssl/certs
    • 对于 openssl,它是 /use/lib/ssl(运行 openssl version -a 看到了)。

但是没有找到那个证书,给我们留下了第二个选择:

  1. 按照上面列出的步骤 1 和 2 进行操作;为此,我们需要获得为该实体颁发的证书。 这可以通过从 its source 下载它或使用浏览器来实现。
    • 例如,使用 Chrome 转到 vimmi.net,单击挂锁 >“证书”>“证书路径”选项卡,select 实体 >“查看证书”,然后在打开的 window 中转到“详细信息”选项卡 > “复制到文件”> Base-64 编码 > 保存文件)

太棒了!现在我们有了那个证书(可以是任何文件格式:cer, pem, etc.;你甚至可以将它保存为 txt 文件),让我们告诉 curl 使用它:

curl --cacert test.cer https://vimmi.net

回到Python

一旦我们有:

  1. “Go Daddy 安全证书颁发机构 - G2”证书
  2. “Go Daddy Root Certificate Authority - G2”证书(上面没有提到,但可以通过类似的方式实现)。

我们需要将它们的内容复制到一个文件中,我们称之为combined.cer,我们将其放在当前目录中。然后,简单地:

import requests

res = requests.get("https://vimmi.net", verify="./combined.cer")
print (res.status_code) # 200
  • 顺便说一句,“Go Daddy Root 证书颁发机构 - G2”被浏览器和各种工具列为受信任的机构;这就是为什么我们不必为 curl.
  • 指定它

进一步阅读: