使用 BeautifulSoup 编辑来自 html 的文本
Edit text from html with BeautifulSoup
我目前正在尝试提取自己有文本的 html 元素,并用特殊标签将它们包装起来。
例如,我的 HTML 看起来像这样:
<ul class="myBodyText">
<li class="fields">
This text still has children
<b>
Simple Text
</b>
<div class="s">
<ul class="section">
<li style="padding-left: 10px;">
Hello <br/>
World
</li>
</ul>
</div>
</li>
</ul>
我试图只在标签周围包装标签,这样我可以在以后进一步解析它们,所以我试着让它看起来像这样:
<ul class="bodytextAttributes">
<li class="field">
[Editable]This text still has children[/Editable]
<b>
[Editable]Simple Text[/Editable]
</b>
<div class="sectionFields">
<ul class="section">
<li style="padding-left: 10px;">
[Editable]Hello [/Editable]<br/>
[Editable]World[/Editable]
</li>
</ul>
</div>
</li>
</ul>
到目前为止,我的脚本迭代得很好,但编辑占位符的位置不起作用,我目前不知道如何检查:
def parseSection(node):
b = str(node)
changes = set()
tag_start, tag_end = extractTags(b)
# index 0 is the element itself
for cell in node.findChildren()[1:]:
if cell.findChildren():
cell = parseSection(cell)
else:
# safe to extract with regular expressions, only 1 standardized tag created by BeautifulSoup
subtag_start, subtag_end = extractTags(str(cell))
changes.add((str(cell), "[/EditableText]{0}[EditableText]{1}[/EditableText]{2}[EditableText]".format(subtag_start, str(cell.text), subtag_end)))
text = extractText(b)
for change in changes:
text = text.replace(change[0], change[1])
return bs("{0}[EditableText]{1}[/EditableText]{2}".format(tag_start, text, tag_end), "html.parser")
脚本生成以下内容:
<ul class="myBodyText">
[EditableText]
<li class="fields">
This text still has children
[/EditableText]
<b>
[EditableText]
Simple Text
[/EditableText]
</b>
[EditableText]
<div class="s">
<ul class="section">
<li style="padding-left: 10px;">
Hello [/EditableText]
<br/>
[EditableText][/EditableText]
<br/>
[EditableText]
World
</li>
</ul>
</div>
</li>
[/EditableText]
</ul>
我如何检查并修复它?我很感激每一个可能的答案。
有一个内置的 replace_with()
方法非常适合用例:
soup = BeautifulSoup(data)
for node in soup.find_all(text=lambda x: x.strip()):
node.replace_with("[Editable]{}[/Editable]".format(node))
print soup.prettify()
我目前正在尝试提取自己有文本的 html 元素,并用特殊标签将它们包装起来。
例如,我的 HTML 看起来像这样:
<ul class="myBodyText">
<li class="fields">
This text still has children
<b>
Simple Text
</b>
<div class="s">
<ul class="section">
<li style="padding-left: 10px;">
Hello <br/>
World
</li>
</ul>
</div>
</li>
</ul>
我试图只在标签周围包装标签,这样我可以在以后进一步解析它们,所以我试着让它看起来像这样:
<ul class="bodytextAttributes">
<li class="field">
[Editable]This text still has children[/Editable]
<b>
[Editable]Simple Text[/Editable]
</b>
<div class="sectionFields">
<ul class="section">
<li style="padding-left: 10px;">
[Editable]Hello [/Editable]<br/>
[Editable]World[/Editable]
</li>
</ul>
</div>
</li>
</ul>
到目前为止,我的脚本迭代得很好,但编辑占位符的位置不起作用,我目前不知道如何检查:
def parseSection(node):
b = str(node)
changes = set()
tag_start, tag_end = extractTags(b)
# index 0 is the element itself
for cell in node.findChildren()[1:]:
if cell.findChildren():
cell = parseSection(cell)
else:
# safe to extract with regular expressions, only 1 standardized tag created by BeautifulSoup
subtag_start, subtag_end = extractTags(str(cell))
changes.add((str(cell), "[/EditableText]{0}[EditableText]{1}[/EditableText]{2}[EditableText]".format(subtag_start, str(cell.text), subtag_end)))
text = extractText(b)
for change in changes:
text = text.replace(change[0], change[1])
return bs("{0}[EditableText]{1}[/EditableText]{2}".format(tag_start, text, tag_end), "html.parser")
脚本生成以下内容:
<ul class="myBodyText">
[EditableText]
<li class="fields">
This text still has children
[/EditableText]
<b>
[EditableText]
Simple Text
[/EditableText]
</b>
[EditableText]
<div class="s">
<ul class="section">
<li style="padding-left: 10px;">
Hello [/EditableText]
<br/>
[EditableText][/EditableText]
<br/>
[EditableText]
World
</li>
</ul>
</div>
</li>
[/EditableText]
</ul>
我如何检查并修复它?我很感激每一个可能的答案。
有一个内置的 replace_with()
方法非常适合用例:
soup = BeautifulSoup(data)
for node in soup.find_all(text=lambda x: x.strip()):
node.replace_with("[Editable]{}[/Editable]".format(node))
print soup.prettify()