是否可以使用 lxml 将文本视为 xml 元素?

Is it possible to treat text as xml element with lxml?

我想过滤元素树以删除重复的元素条目。简而言之,我正在尝试将 xml 输出清理为可以由不同工具解析的内容。

例如

<p>
  <p>
    Text node 1
    <ul>
      <li>asdasd</li>
    </ul>  
    <p>
      Text node 2 <span>Som text</span>
    </p>
    Text node 3
  </p>
  <p>Text node 4</p>
</p>

将转换为:

<p>
  Text node 1
  <ul>
  <li>asdasd</li>
  </ul>
</p>
<p>Text node 2 <span>Som text</span></p>
<p>Text node 3</p>
<p>Text node 4</p>

在lxml中,getchildren似乎只有returnxml元素。因此,当我在包含 ulp 上调用 getchildren 时。它会 return 一个类似于 [ul, p] 的列表,我想要一个包含以下内容的列表:

[Text, Ul, P, Text] 这样我就可以轻松地沿着树向下或向上走,以减少多余的元素。

lxml 的文档表明它们没有文本节点,并且该文本要么是通过 .text 访问的该元素的一部分,要么是通过 .tail 访问的结束标记的尾部。

<html><body>Hello<br/>World</body></html>

Here, the <br/> tag is surrounded by text. This is often referred to as document-style or mixed-content XML. Elements support this through their tail property. It contains the text that directly follows the element, up to the next element in the XML tree.

The two properties .text and .tail are enough to represent any text content in an XML document. This way, the ElementTree API does not require any special text nodes in addition to the Element class, that tend to get in the way fairly often (as you might know from classic DOM APIs).

我不能说下面的内容很漂亮或正是您想要的,但至少可以让您更接近方向。

from lxml import etree

tree = etree.parse("test.dat").getroot()
main_p = tree[0]
elements = [main_p.text]
for child in main_p:
    elements.append(child.tag)
    elements.append(child.tail)
    print(f"TAG: {child.tag} has tail: #{child.tail}#")

print(elements)

输出

TAG: ul has tail: #
    #
TAG: p has tail: #
    Text node 3
  #
['\n    Text node 1\n    ', 'ul', '\n    ', 'p', '\n    Text node 3\n  ']

所以"Text node 1"是主p的正文。但是 "Text node 3" 而它在主 p 中实际上是内部 p 的尾标签。

除此之外,您还可以遍历主 p 元素,如果子元素是 p 标签,您可以将其移出主 p 并将其添加到根标签中。下面再次只是一个例子。

from lxml import etree

tree = etree.parse("test.dat").getroot()
main_p = tree[0]
elements = [main_p.text]
for child in main_p[::-1]:
    if child.tag == 'p':
        tree.insert(tree.index(main_p) + 1, child)
        new_p = etree.Element('p')
        new_p.text = child.tail
        tree.insert(tree.index(child)+1, new_p)
        child.tail = "\n"

tree.tag = 'something_else'
print(etree.tostring(tree, pretty_print=True).decode('utf-8'))

输出

<something_else>
   <p>
      Text node 1
      <ul>
         <li>asdasd</li>
      </ul>
   </p>
   <p>
      Text node 2
      <span>Som text</span>
   </p>
   <p>Text node 3</p>
   <p>Text node 4</p>
</something_else>