获取没有内部子标签文本的 HTML 个标签的文本

Question

示例：

有时 HTML 是：

<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>

其他时候只是：

<div id="1">
    this is the text i want here
</div>

我只想获取一个标签中的文本，而忽略所有其他子标签。如果我运行 .text 属性，我会得到两者。

Answer 1

已更新 以使用更通用的方法（请参阅原始答案的编辑历史记录）：

您可以通过测试是否是 NavigableString.

的实例来提取外部 div 的子元素

from bs4 import BeautifulSoup, NavigableString

html = '''<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>'''

soup = BeautifulSoup(html)    
outer = soup.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

这会生成包含在外部 div 元素中的字符串列表。

>>> inner_text
[u'\n', u'\n    this is the text i want here\n']
>>> ''.join(inner_text)
u'\n\n    this is the text i want here\n'

对于你的第二个例子：

html = '''<div id="1">
    this is the text i want here
</div>'''
soup2 = BeautifulSoup(html)    
outer = soup2.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

>>> inner_text
[u'\n    this is the text i want here\n']

这也适用于其他情况，例如外部 div 的文本元素出现在任何子标签之前、子标签之间、多个文本元素或根本不存在。

Answer 2

另一种可能的方法（我会在函数中实现）：

def getText(parent):
    return ''.join(parent.find_all(text=True, recursive=False)).strip()

recursive=False 表示您只想要直接子代，而不是嵌套子代。 text=True 表示您只需要文本节点。

用法示例：

from bs4 import BeautifulSoup

html = """<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>
"""
soup = BeautifulSoup(html)
print(getText(soup.div))
#this is the text i want here

获取没有内部子标签文本的 HTML 个标签的文本

Get text of HTML tags without text of inner child tags

python

beautifulsoup

python-2.7