BeautifulSoup 某些 URL 超时？

Question

我是使用 BeautifulSoup 的新手，我运行遇到了一个奇怪的问题，可能是用户错误，但我很困惑！我正在使用 BeautifulSoup 来解析网页，return 第一个带有 href 属性的标签。当我使用维基百科 link 时，它按预期工作！然而，当我使用 BestBuy link 时，它会导致超时...

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import urllib.request

# url = r"https://en.wikipedia.org/wiki/Eastern_Front_(World_War_II)"
url = r"https://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?skuId=6364255"

html_content = urllib.request.urlopen(url)
soup = BeautifulSoup(html_content, 'html.parser')

link = soup.find('a', href=True)

print(link)

Traceback (most recent call last):
  File "scrapper.py", line 8, in <module>
    html_content = urllib.request.urlopen(url)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1393, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1354, in do_open
    r = h.getresponse()
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1347, in getresponse
    response.begin()
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
TimeoutError: [Errno 60] Operation timed out

你们知道为什么只有某些 URL 会发生这种情况吗？提前致谢！

Answer 1

您无法抓取所有使用 BeautifulSoap 的网站，某些网站有限制。最佳做法始终使用 headers:

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}

url = r"https://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?skuId=6364255"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())

输出：

<html>
 <head>
  <title>
   Access Denied
  </title>
 </head>
 <body>
  <h1>
   Access Denied
  </h1>
  You don't have permission to access "http://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?" on this server.
  <p>
   Reference #18.9f01d517.1595655333.b833c
  </p>
 </body>
</html>

您可以使用 selenium 完成此任务，请按照以下步骤操作：

第 1 步：为 chrome 下载网络 driver：

首先检查您的 chrome 版本（浏览器菜单（三个垂直点）-> 帮助 -> 关于 Google Chrome

第 2 步：根据您的 chrome 浏览器版本（我的是 81.0.4044.138）从 here 下载 Driver

第 3 步：下载后解压缩文件并将 chromedriver.exe 放在脚本所在的目录中。

第 4 步：pip install selenium

现在使用下面的代码：

from selenium import webdriver
import os
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import urllib.request

#your website url
site = 'https://www.bestbuy.com/site/nintendo-switch-32gb-console-neon-red-neon-blue-joy-con/6364255.p?skuId=6364255'

#your driver path
driver = webdriver.Chrome(executable_path = 'chromedriver.exe')
#passing website url
driver.get(site)
soup = BeautifulSoup(driver.page_source, 'html.parser')

driver.close()
link = soup.find('a', href=True)

print(link)

输出：

<a href="https://www.bestbuy.ca/en-CA/home.aspx">
<img alt="Canada" src="https://www.bestbuy.com/~assets/bby/_intl/landing_page/images/maps/canada.svg"/>
<h4>Canada</h4>
</a>

BeautifulSoup 某些 URL 超时？

BeautifulSoup timing out with certain URL's?

python

timeout

beautifulsoup

web-scraping

python-3.8