如何在没有尾巴的情况下从 lxml 中的节点删除标签？

Question

示例：

html = <a><b>Text</b>Text2</a>

BeautifullSoup代码

[x.extract() for x in html.findAll(.//b)]

在出口我们有：

html = <a>Text2</a>

Lxml代码：

[bad.getparent().remove(bad) for bad in html.xpath(".//b")]

在出口我们有：

html = <a></a>

因为 lxml 认为 "Text2" 它是 <b></b>

的尾巴

如果我们只需要来自标签连接的文本行，我们可以使用：

for bad in raw.xpath(xpath_search):
    bad.text = ''

但是，如何在不更改文本的情况下删除没有尾巴的标签？

Answer 1

编辑：

请看@Joshmakers 的回答，显然是更好的答案。

我做了以下操作以将尾部文本安全到前一个兄弟或 parent。

def remove_keeping_tail(self, element):
    """Safe the tail text and then delete the element"""
    self._preserve_tail_before_delete(element)
    element.getparent().remove(element)

def _preserve_tail_before_delete(self, node):
    if node.tail: # preserve the tail
        previous = node.getprevious()
        if previous is not None: # if there is a previous sibling it will get the tail
            if previous.tail is None:
                previous.tail = node.tail
            else:
                previous.tail = previous.tail + node.tail
        else: # The parent get the tail as text
            parent = node.getparent()
            if parent.text is None:
                parent.text = node.tail
            else:
                parent.text = parent.text + node.tail

HTH

Answer 2

虽然 phlou 接受的答案会起作用，但有更简单的方法可以删除标签而不用删除它们的尾巴。

如果要删除特定元素，那么您要查找的 LXML 方法是 drop_tree。

来自文档：

Drops the element and all its children. Unlike el.getparent().remove(el) this does not remove the tail text; with drop_tree the tail text is merged with the previous element.

如果要删除特定标签的所有实例，可以使用 lxml.etree.strip_elements 或 lxml.html.etree.strip_elements 和 with_tail=False。

Delete all elements with the provided tag names from a tree or subtree. This will remove the elements and their entire subtree, including all their attributes, text content and descendants. It will also remove the tail text of the element unless you explicitly set the with_tail keyword argument option to False.

所以，对于原文中的例子post：

>>> from lxml.html import fragment_fromstring, tostring
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> for bad in html.xpath('.//b'):
...    bad.drop_tree()
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'

或

>>> from lxml.html import fragment_fromstring, tostring, etree
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> etree.strip_elements(html, 'b', with_tail=False)
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'

如何在没有尾巴的情况下从 lxml 中的节点删除标签？

How delete tag from node in lxml without tail?

python

lxml

beautifulsoup

html-parsing

请看@Joshmakers 的回答，显然是更好的答案。