BeautifulSoup 某些 URL 超时?
BeautifulSoup timing out with certain URL's?
我是使用 BeautifulSoup 的新手,我 运行 遇到了一个奇怪的问题,可能是用户错误,但我很困惑!我正在使用 BeautifulSoup 来解析网页,return 第一个带有 href 属性的标签。当我使用维基百科 link 时,它按预期工作!然而,当我使用 BestBuy link 时,它会导致超时...
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import urllib.request
# url = r"https://en.wikipedia.org/wiki/Eastern_Front_(World_War_II)"
url = r"https://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?skuId=6364255"
html_content = urllib.request.urlopen(url)
soup = BeautifulSoup(html_content, 'html.parser')
link = soup.find('a', href=True)
print(link)
Traceback (most recent call last):
File "scrapper.py", line 8, in <module>
html_content = urllib.request.urlopen(url)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1393, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1354, in do_open
r = h.getresponse()
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1347, in getresponse
response.begin()
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
TimeoutError: [Errno 60] Operation timed out
你们知道为什么只有某些 URL 会发生这种情况吗?提前致谢!
您无法抓取所有使用 BeautifulSoap 的网站,某些网站有限制。最佳做法始终使用 headers:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}
url = r"https://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?skuId=6364255"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())
输出:
<html>
<head>
<title>
Access Denied
</title>
</head>
<body>
<h1>
Access Denied
</h1>
You don't have permission to access "http://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?" on this server.
<p>
Reference #18.9f01d517.1595655333.b833c
</p>
</body>
</html>
您可以使用 selenium 完成此任务,请按照以下步骤操作:
第 1 步:为 chrome 下载网络 driver:
首先检查您的 chrome 版本(浏览器菜单(三个垂直点)-> 帮助 -> 关于 Google Chrome
第 2 步:根据您的 chrome 浏览器版本(我的是 81.0.4044.138)从 here 下载 Driver
第 3 步:下载后解压缩文件并将 chromedriver.exe 放在脚本所在的目录中。
第 4 步:pip install selenium
现在使用下面的代码:
from selenium import webdriver
import os
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import urllib.request
#your website url
site = 'https://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?skuId=6364255'
#your driver path
driver = webdriver.Chrome(executable_path = 'chromedriver.exe')
#passing website url
driver.get(site)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
link = soup.find('a', href=True)
print(link)
输出:
<a href="https://www.bestbuy.ca/en-CA/home.aspx">
<img alt="Canada" src="https://www.bestbuy.com/~assets/bby/_intl/landing_page/images/maps/canada.svg"/>
<h4>Canada</h4>
</a>
我是使用 BeautifulSoup 的新手,我 运行 遇到了一个奇怪的问题,可能是用户错误,但我很困惑!我正在使用 BeautifulSoup 来解析网页,return 第一个带有 href 属性的标签。当我使用维基百科 link 时,它按预期工作!然而,当我使用 BestBuy link 时,它会导致超时...
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import urllib.request
# url = r"https://en.wikipedia.org/wiki/Eastern_Front_(World_War_II)"
url = r"https://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?skuId=6364255"
html_content = urllib.request.urlopen(url)
soup = BeautifulSoup(html_content, 'html.parser')
link = soup.find('a', href=True)
print(link)
Traceback (most recent call last):
File "scrapper.py", line 8, in <module>
html_content = urllib.request.urlopen(url)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 525, in open
response = self._open(req, data)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 542, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1393, in https_open
return self.do_open(http.client.HTTPSConnection, req,
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1354, in do_open
r = h.getresponse()
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1347, in getresponse
response.begin()
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
TimeoutError: [Errno 60] Operation timed out
你们知道为什么只有某些 URL 会发生这种情况吗?提前致谢!
您无法抓取所有使用 BeautifulSoap 的网站,某些网站有限制。最佳做法始终使用 headers:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}
url = r"https://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?skuId=6364255"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())
输出:
<html>
<head>
<title>
Access Denied
</title>
</head>
<body>
<h1>
Access Denied
</h1>
You don't have permission to access "http://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?" on this server.
<p>
Reference #18.9f01d517.1595655333.b833c
</p>
</body>
</html>
您可以使用 selenium 完成此任务,请按照以下步骤操作:
第 1 步:为 chrome 下载网络 driver:
首先检查您的 chrome 版本(浏览器菜单(三个垂直点)-> 帮助 -> 关于 Google Chrome
第 2 步:根据您的 chrome 浏览器版本(我的是 81.0.4044.138)从 here 下载 Driver
第 3 步:下载后解压缩文件并将 chromedriver.exe 放在脚本所在的目录中。
第 4 步:pip install selenium
现在使用下面的代码:
from selenium import webdriver
import os
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import urllib.request
#your website url
site = 'https://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?skuId=6364255'
#your driver path
driver = webdriver.Chrome(executable_path = 'chromedriver.exe')
#passing website url
driver.get(site)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.close()
link = soup.find('a', href=True)
print(link)
输出:
<a href="https://www.bestbuy.ca/en-CA/home.aspx">
<img alt="Canada" src="https://www.bestbuy.com/~assets/bby/_intl/landing_page/images/maps/canada.svg"/>
<h4>Canada</h4>
</a>