Python

Question

HTML content

我有一个网页要解析。 HTML代码如图

我需要提取价格，是简单的文本：

<div class="price">
"212,25 € "
<sup>HT</sup>

这是页面上唯一的“价格”class。所以我调用 find() 方法：

soup = BeautifulSoup(get(url, headers=headers, params=params).content, 'lxml')
container = soup.find_all('div', class_="side-content") # Find a container
cost = container.find('div', {'class': 'price'}) # Find price class
cost_value = cost.next_sibling

费用为None。我试过 .next_sibling 函数和 .text 函数。但是作为 find() returns None，我有一个例外。我该如何解决？

Answer 1

这里的技巧是：

cost = cost.find(text=True).strip()

我们 find() 所有文本，strip() 任何空格。

find(text=True) 将输出限制为 <div> 因此它将忽略嵌套的 <sup>

关于容器：

This is the only "price" class on the page

那又何必呢？只需搜索价格

from bs4 import BeautifulSoup

html = """
<div class="price">
    "212,25 € "
<sup>HT</sup>
"""

soup = BeautifulSoup(html, 'html.parser')

cost = soup.find('div', {'class': 'price'})
cost = cost.find(text=True).strip()

print(cost)

将输出：

212,25 €

Answer 2

我已经解决了。问题出在 JavaScript-generated 数据中。所以静态解析方法不适用于它。我尝试了几种解决方案（包括 Selenium 和 XHR 脚本结果捕获）。

最后，在我解析的数据中，我找到了链接到单独网页的静态 URL 页面，其中执行了此 JavaScript 代码，并且可以通过静态方法进行解析。

视频 "Python Web Scraping Tutorial: scraping dynamic JavaScript/Ajax websites with Beautiful Soup" 解释了类似的解决方案。

Python - Beautiful Soup - 提取 <div> 和 <sup> 之间的文本

Python - Beautiful Soup - extract text between <div> and <sup>

html

beautifulsoup