Python 网络抓取抛出连接错误
Python web crawling is throwing connection errors
我有以下 Python 用于网络抓取的代码,当我尝试 运行 这个代码时,它抛出了以下错误。 代码 :
import lxml.html
import requests
from bs4 import BeautifulSoup
url1='http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;filter=advanced;orderby=runs;'
url2 ='page='
url3 ='size=200;template=results;type=batting'
url5 = ['http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;filter=advanced;orderby=runs;size=200;template=results;type=batting']
for i in range(2,3854):
url4 = url1 + url2 + str(i) + ';' + url3
url5.append(url4)
for page in url5:
source_code = requests.get(page, verify=False)
# just get the code, no headers or anything
plain_text = source_code.text
# BeautifulSoup objects can be sorted through easy
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': 'data-link'}):
href = "https://www.espncricinfo.com" + link.get('href')
title = link.string # just the text, not the HTML
source_code = requests.get(href)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
# if you want to gather information from that page
for item_name in soup.findAll('span', {'class': 'ciPlayerinformationtxt'}):
print(item_name.string)
错误:
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 559, in urlopen
body=body, headers=headers)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 345, in _make_request
self._validate_conn(conn)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 782, in _validate_conn
conn.connect()
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connection.py", line 266, in connect
match_hostname(cert, self.assert_hostname or hostname)
File "C:\Python34\lib\ssl.py", line 285, in match_hostname
% (hostname, ', '.join(map(repr, dnsnames))))
ssl.CertificateError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
在处理上述异常的过程中,又发生了一个异常:
Traceback (most recent call last): File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\adapters.py", line 369, in send
timeout=timeout File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 588, in urlopen
raise SSLError(e) requests.packages.urllib3.exceptions.SSLError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
在处理上述异常的过程中,又发生了一个异常:
Traceback (most recent call last):
File "C:/Python34/intplayername.py", line 23, in <module>
source_code = requests.get(href)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\sessions.py", line 471, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\sessions.py", line 579, in send
r = adapter.send(request, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\adapters.py", line 430, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
这是因为您要抓取的站点上的 https 证书配置错误。作为解决方法,您可以在 requests
库
中关闭证书检查
requests.get(href, verify=False)
请注意,当您处理敏感信息时,不推荐这样做。
我有以下 Python 用于网络抓取的代码,当我尝试 运行 这个代码时,它抛出了以下错误。 代码 :
import lxml.html
import requests
from bs4 import BeautifulSoup
url1='http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;filter=advanced;orderby=runs;'
url2 ='page='
url3 ='size=200;template=results;type=batting'
url5 = ['http://stats.espncricinfo.com/ci/engine/stats/index.html?class=11;filter=advanced;orderby=runs;size=200;template=results;type=batting']
for i in range(2,3854):
url4 = url1 + url2 + str(i) + ';' + url3
url5.append(url4)
for page in url5:
source_code = requests.get(page, verify=False)
# just get the code, no headers or anything
plain_text = source_code.text
# BeautifulSoup objects can be sorted through easy
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': 'data-link'}):
href = "https://www.espncricinfo.com" + link.get('href')
title = link.string # just the text, not the HTML
source_code = requests.get(href)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
# if you want to gather information from that page
for item_name in soup.findAll('span', {'class': 'ciPlayerinformationtxt'}):
print(item_name.string)
错误:
Traceback (most recent call last):
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 559, in urlopen
body=body, headers=headers)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 345, in _make_request
self._validate_conn(conn)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 782, in _validate_conn
conn.connect()
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connection.py", line 266, in connect
match_hostname(cert, self.assert_hostname or hostname)
File "C:\Python34\lib\ssl.py", line 285, in match_hostname
% (hostname, ', '.join(map(repr, dnsnames))))
ssl.CertificateError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
在处理上述异常的过程中,又发生了一个异常:
Traceback (most recent call last): File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\adapters.py", line 369, in send
timeout=timeout File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\packages\urllib3\connectionpool.py", line 588, in urlopen
raise SSLError(e) requests.packages.urllib3.exceptions.SSLError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
在处理上述异常的过程中,又发生了一个异常:
Traceback (most recent call last):
File "C:/Python34/intplayername.py", line 23, in <module>
source_code = requests.get(href)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\sessions.py", line 471, in request
resp = self.send(prep, **send_kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\sessions.py", line 579, in send
r = adapter.send(request, **kwargs)
File "C:\Python34\lib\site-packages\requests-2.8.0-py3.4.egg\requests\adapters.py", line 430, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: hostname 'www.espncricinfo.com' doesn't match either of 'a248.e.akamai.net', '*.akamaihd.net', '*.akamaihd-staging.net', '*.akamaized.net', '*.akamaized-staging.net'
这是因为您要抓取的站点上的 https 证书配置错误。作为解决方法,您可以在 requests
库
requests.get(href, verify=False)
请注意,当您处理敏感信息时,不推荐这样做。