如何在 BeautifulSoup 中删除的标签周围添加 space
How to add space around removed tags in BeautifulSoup
from BeautifulSoup import BeautifulSoup
html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>
<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''
soup = BeautifulSoup(html)
all_poems = soup.findAll("div", {"class": "thisText"})
for poems in all_poems:
print(poems.text)
我有这个示例代码,但我找不到如何在删除的标签周围添加空格,所以当 <a href...>
中的文本被格式化时,它可以被阅读并且不会像这样显示:
PoemThe RavenOnce upon a midnight dreary, while I pondered, weak and weary...
In the greenest of our valleys By good angels tenanted..., part ofThe Haunted Palace
这里是使用 lxml 及其 xpath
函数搜索所有文本节点的替代方法:
from lxml import etree
html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>
<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''
root = etree.fromstring(html, etree.HTMLParser())
print(' '.join(root.xpath("//text()")))
它产生:
Poem The Raven Once upon a midnight dreary, while I pondered, weak and weary...
In the greenest of our valleys By good angels tenanted..., part of The Haunted Palace
一种选择是找到所有文本节点并用 space:
连接它们
" ".join(item.strip() for item in poems.find_all(text=True))
此外,您使用的 beautifulsoup3
软件包已过时且未维护。升级到 beautifulsoup4
:
pip install beautifulsoup4
并替换:
from BeautifulSoup import BeautifulSoup
与:
from bs4 import BeautifulSoup
get_text()
in beautifoulsoup4
有一个名为 separator
的可选输入。您可以按如下方式使用它:
soup = BeautifulSoup(html)
text = soup.get_text(separator=' ')
from BeautifulSoup import BeautifulSoup
html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>
<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''
soup = BeautifulSoup(html)
all_poems = soup.findAll("div", {"class": "thisText"})
for poems in all_poems:
print(poems.text)
我有这个示例代码,但我找不到如何在删除的标签周围添加空格,所以当 <a href...>
中的文本被格式化时,它可以被阅读并且不会像这样显示:
PoemThe RavenOnce upon a midnight dreary, while I pondered, weak and weary...
In the greenest of our valleys By good angels tenanted..., part ofThe Haunted Palace
这里是使用 lxml 及其 xpath
函数搜索所有文本节点的替代方法:
from lxml import etree
html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>
<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''
root = etree.fromstring(html, etree.HTMLParser())
print(' '.join(root.xpath("//text()")))
它产生:
Poem The Raven Once upon a midnight dreary, while I pondered, weak and weary...
In the greenest of our valleys By good angels tenanted..., part of The Haunted Palace
一种选择是找到所有文本节点并用 space:
连接它们" ".join(item.strip() for item in poems.find_all(text=True))
此外,您使用的 beautifulsoup3
软件包已过时且未维护。升级到 beautifulsoup4
:
pip install beautifulsoup4
并替换:
from BeautifulSoup import BeautifulSoup
与:
from bs4 import BeautifulSoup
get_text()
in beautifoulsoup4
有一个名为 separator
的可选输入。您可以按如下方式使用它:
soup = BeautifulSoup(html)
text = soup.get_text(separator=' ')