使用 BeautifulSoup 从文本中删除标签
Remove tag from text with BeautifulSoup
这里有很多标题相似的问题,但我正在尝试从汤 object 本身中删除标签。
我有一个页面,其中包含此 div
:
<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>
我可以 select <div id="content">
和 soup.find('div', id='content')
但我想从中删除 <div id="blah">
。
如果要从树中删除标签或字符串,可以使用 extract
。
In [13]: soup = BeautifulSoup("""<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>""")
In [14]: soup = BeautifulSoup("""<div id="content">
....: I want to keep this<br /><div id="blah">I want to remove this</div>
....: </div>""")
In [15]: blah = soup.find(id='blah')
In [16]: _ = blah.extract()
In [17]: soup
Out[17]:
<html><body><div id="content">
I want to keep this<br/>
</div></body></html>
The Tag.decompose
method 从树中删除 tag
。
所以找到 div
标签:
div = soup.find('div', {'id':'content'})
遍历所有 children 但第一个:
for child in list(div)[1:]:
并尝试分解 children:
try:
child.decompose()
except AttributeError: pass
import bs4 as bs
content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
soup = bs.BeautifulSoup(content)
div = soup.find('div', {'id':'content'})
for child in list(div)[1:]:
try:
child.decompose()
except AttributeError: pass
print(div)
产量
<div id="content">
I want to keep this
</div>
使用 lxml 的等价物是
import lxml.html as LH
content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
root = LH.fromstring(content)
div = root.xpath('//div[@id="content"]')[0]
for child in div:
div.remove(child)
print(LH.tostring(div))
这里有很多标题相似的问题,但我正在尝试从汤 object 本身中删除标签。
我有一个页面,其中包含此 div
:
<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>
我可以 select <div id="content">
和 soup.find('div', id='content')
但我想从中删除 <div id="blah">
。
如果要从树中删除标签或字符串,可以使用 extract
。
In [13]: soup = BeautifulSoup("""<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>""")
In [14]: soup = BeautifulSoup("""<div id="content">
....: I want to keep this<br /><div id="blah">I want to remove this</div>
....: </div>""")
In [15]: blah = soup.find(id='blah')
In [16]: _ = blah.extract()
In [17]: soup
Out[17]:
<html><body><div id="content">
I want to keep this<br/>
</div></body></html>
The Tag.decompose
method 从树中删除 tag
。
所以找到 div
标签:
div = soup.find('div', {'id':'content'})
遍历所有 children 但第一个:
for child in list(div)[1:]:
并尝试分解 children:
try:
child.decompose()
except AttributeError: pass
import bs4 as bs
content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
soup = bs.BeautifulSoup(content)
div = soup.find('div', {'id':'content'})
for child in list(div)[1:]:
try:
child.decompose()
except AttributeError: pass
print(div)
产量
<div id="content">
I want to keep this
</div>
使用 lxml 的等价物是
import lxml.html as LH
content = '''<div id="content">
I want to keep this<br /><div id="blah">I want to remove this</div>
</div>'''
root = LH.fromstring(content)
div = root.xpath('//div[@id="content"]')[0]
for child in div:
div.remove(child)
print(LH.tostring(div))