使用 requests-html 库在 python 中抓取一个网站，当被 beautifulsoup 选择时它没有获取所有元素

Question

正在尝试通过以下代码段 python 抓取 https://edition.cnn.com/world。问题是当使用 BeautifulSoup 解析内容时，我没有得到我想要的所有数据。我得到了 20 个左右的元素，但还有很多应该选择的项目

from requests_html import HTMLSession
from bs4 import BeautifulSoup as bs

url = "https://edition.cnn.com/world"
s = HTMLSession()
response = s.get(url)
response.html.render(wait=20)
soup = bs(response.content, 'html.parser')
results = soup.select('div.cd__wrapper')
print(len(results))  # returns 20 or so

基本上我应该使用 selenium，但由于不仅有这个网站，它可能会变得很麻烦。显然该网站在加载时使用了一些 javascripts，因此导致了这个问题。我想知道这里有什么调整，或者是否可以在不被迫使用 selenium

的情况下执行此操作

Answer 1

那是因为无论您使用什么库或模块来提取 html 标签，都可能无法获取所有标签。不幸的是，除非我运行你的代码，否则无法分辨。

1.) 标签在数组中所以你必须枚举

或

2.) beautifulsoup HTMLSession

有问题

尝试使用 from urllib.request import urlopen as uReq

使用示例：

xClient = uReq(YOUR_URL) 
Raw_html = xClient.read()
xClient.close()

确保在使用后关闭连接。

Answer 2

我担心为每个新页面找到一个新的调整而不是仅仅使用 selenium 来获取 html。

原则上，您可以调用调用相应 container-managers

的单个请求

<script>CNN.covCon.push({id: "coverageContainer_8DDF4E26-8632-6418-1586-B910547ED120",layout: "list-hierarchical-xs",src: "/data/ocs/container/coverageContainer_8DDF4E26-8632-6418-1586-B910547ED120:list-hierarchical-xs/views/containers/common/container-manager.html"});</script>

再次单独进行，这样您就不必使用 selenium 但随后您还必须对其他每个页面进行此类调整，这会花费时间并且根本不稳定。

为了以防万一，不需要那么多努力，您可以用 BeautifulSoup:

处理 html

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

service = Service(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
driver = webdriver.Chrome(service=service)
driver.get('https://edition.cnn.com/world')

soup = BeautifulSoup(driver.page_source,'html.parser' )
len(soup.select('.cd__wrapper'))

输出 --> 116

使用 requests-html 库在 python 中抓取一个网站，当被 beautifulsoup 选择时它没有获取所有元素

scraping a web site in python with requests-html library, it didn't get all the elements when selected by beautifulsoup

python

beautifulsoup

python-requests