BeautifulSoup 某些 URL 超时?

BeautifulSoup timing out with certain URL's?

我是使用 BeautifulSoup 的新手,我 运行 遇到了一个奇怪的问题,可能是用户错误,但我很困惑!我正在使用 BeautifulSoup 来解析网页,return 第一个带有 href 属性的标签。当我使用维基百科 link 时,它按预期工作!然而,当我使用 BestBuy link 时,它会导致超时...

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import urllib.request

# url = r"https://en.wikipedia.org/wiki/Eastern_Front_(World_War_II)"
url = r"https://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?skuId=6364255"

html_content = urllib.request.urlopen(url)
soup = BeautifulSoup(html_content, 'html.parser')

link = soup.find('a', href=True)

print(link)
Traceback (most recent call last):
  File "scrapper.py", line 8, in <module>
    html_content = urllib.request.urlopen(url)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1393, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1354, in do_open
    r = h.getresponse()
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1347, in getresponse
    response.begin()
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
TimeoutError: [Errno 60] Operation timed out

你们知道为什么只有某些 URL 会发生这种情况吗?提前致谢!

您无法抓取所有使用 BeautifulSoap 的网站,某些网站有限制。最佳做法始终使用 headers:

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}

url = r"https://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?skuId=6364255"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())

输出:

<html>
 <head>
  <title>
   Access Denied
  </title>
 </head>
 <body>
  <h1>
   Access Denied
  </h1>
  You don't have permission to access "http://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?" on this server.
  <p>
   Reference #18.9f01d517.1595655333.b833c
  </p>
 </body>
</html>

您可以使用 selenium 完成此任务,请按照以下步骤操作:

第 1 步:为 chrome 下载网络 driver:

首先检查您的 chrome 版本(浏览器菜单(三个垂直点)-> 帮助 -> 关于 Google Chrome

第 2 步:根据您的 chrome 浏览器版本(我的是 81.0.4044.138)从 here 下载 Driver

第 3 步:下载后解压缩文件并将 chromedriver.exe 放在脚本所在的目录中。

第 4 步:pip install selenium

现在使用下面的代码:

from selenium import webdriver
import os
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import urllib.request

#your website url
site = 'https://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?skuId=6364255'

#your driver path
driver = webdriver.Chrome(executable_path = 'chromedriver.exe')
#passing website url
driver.get(site)
soup = BeautifulSoup(driver.page_source, 'html.parser')

driver.close()
link = soup.find('a', href=True)

print(link)

输出:

<a href="https://www.bestbuy.ca/en-CA/home.aspx">
<img alt="Canada" src="https://www.bestbuy.com/~assets/bby/_intl/landing_page/images/maps/canada.svg"/>
<h4>Canada</h4>
</a>