使用 BeautifulSoup 编辑来自 html 的文本

Edit text from html with BeautifulSoup

我目前正在尝试提取自己有文本的 html 元素,并用特殊标签将它们包装起来。

例如,我的 HTML 看起来像这样:

<ul class="myBodyText">
 <li class="fields">
  This text still has children
  <b>
   Simple Text
  </b>
  <div class="s">
   <ul class="section">
    <li style="padding-left: 10px;">
     Hello <br/>
     World
    </li>
   </ul>
  </div>
 </li>
</ul>

我试图只在标签周围包装标签,这样我可以在以后进一步解析它们,所以我试着让它看起来像这样:

<ul class="bodytextAttributes">
 <li class="field">
  [Editable]This text still has children[/Editable]
  <b>
   [Editable]Simple Text[/Editable]
  </b>
  <div class="sectionFields">
   <ul class="section">
    <li style="padding-left: 10px;">
     [Editable]Hello [/Editable]<br/>
     [Editable]World[/Editable]
    </li>
   </ul>
  </div>
 </li>
</ul>

到目前为止,我的脚本迭代得很好,但编辑占位符的位置不起作用,我目前不知道如何检查:

def parseSection(node):
    b = str(node)
    changes = set()
    tag_start, tag_end = extractTags(b)
    # index 0 is the element itself
    for cell in node.findChildren()[1:]:
        if cell.findChildren():
            cell = parseSection(cell)
        else:
            # safe to extract with regular expressions, only 1 standardized tag created by BeautifulSoup
            subtag_start, subtag_end = extractTags(str(cell))
            changes.add((str(cell), "[/EditableText]{0}[EditableText]{1}[/EditableText]{2}[EditableText]".format(subtag_start, str(cell.text), subtag_end)))

    text = extractText(b)
    for change in changes:
        text = text.replace(change[0], change[1])
    return bs("{0}[EditableText]{1}[/EditableText]{2}".format(tag_start, text, tag_end), "html.parser")

脚本生成以下内容:

<ul class="myBodyText">
 [EditableText]
 <li class="fields">
  This text still has children
      [/EditableText]
  <b>
   [EditableText]
       Simple Text
      [/EditableText]
  </b>
  [EditableText]
  <div class="s">
   <ul class="section">
    <li style="padding-left: 10px;">
     Hello [/EditableText]
     <br/>
     [EditableText][/EditableText]
     <br/>
     [EditableText]
         World
    </li>
   </ul>
  </div>
 </li>
 [/EditableText]
</ul>

我如何检查并修复它?我很感激每一个可能的答案。

有一个内置的 replace_with() 方法非常适合用例:

soup = BeautifulSoup(data)

for node in soup.find_all(text=lambda x: x.strip()):
    node.replace_with("[Editable]{}[/Editable]".format(node))

print soup.prettify()