beautifulsoup .get_text() 对于我的 HTML 解析不够具体

Question

鉴于下面的 HTML 代码，我只想输出 h1 的文本而不是 "Details about "，后者是跨度的文本（由 h1 封装）。

我当前的输出为：

Details about   New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

我愿意：

New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

这是我正在使用的HTML

<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>

这是我当前的代码：

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    print line.get_text()

注意：我不想截断字符串，因为我希望此代码具有一定的可重用性。最好的是一些代码可以裁剪出范围内的任何文本。

Answer 1

一种解决方案是检查字符串是否包含 html:

from bs4 import BeautifulSoup

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if bool(BeautifulSoup(str(content), "html.parser").find()):
            continue

        print content

另一种解决方案（我更喜欢）是检查 bs4.element.Tag:

的实例

import bs4

html = """<h1 class="it-ttl" itemprop="name" id="itemTitle"><span class="g-hdn">Details about  &nbsp;</span>New Men&#039;s Genuine Leather Bifold ID Credit Card Money Holder Wallet Black</h1>"""
soup = bs4.BeautifulSoup(html, 'html.parser')

for line in soup.find_all('h1', attrs={'itemprop': 'name'}):
    for content in line.contents:
        if isinstance(content, bs4.element.Tag):
            continue

        print content

Answer 2

您可以使用 extract() 删除所有 span 标签：

for line in soup.find_all('h1',attrs={'itemprop':'name'}):
    [s.extract() for s in line('span')]
print line.get_text()
# => New Men's Genuine Leather Bifold ID Credit Card Money Holder Wallet Black

beautifulsoup .get_text() 对于我的 HTML 解析不够具体

beautifulsoup .get_text() is not specific enough for my HTML parsing

html

python

regex

beautifulsoup