python 删除 xml 中的非标签

Question

我想删除所有不在 xml 标签中的内容（清理），并可选择将其放入列表中。我有一些 xml 这样的：

<tag>some text</tag> unwanted text <tag>some text</tag>

我想用 python（正则表达式）

得到这个

('<tag>some text</tag>','<tag>some text</tag>')

我试过：

cleanup = re.findall(r"^<.>.*</.>$",  input)

但我认为整个输入也与正则表达式匹配我该如何解决这个问题？

更新 1：

我尝试用

加载它

import xml.etree.ElementTree as ET
root = ET.fromstring(str(cleanup))

Answer 1

只是想扩展这里已经回答的内容，因为我认为正确的方法是不是使用正则表达式来处理类似xml的内容。您应该使用 XML 解析器，不需要的 内容称为 tail，您可以 CLEAN 解析时，这是一种方法：

import xml.etree.ElementTree as ET

s = '''<root><tag>some text</tag> unwanted text <tag>some text</tag></root>'''

tree = ET.fromstring(s)

cleaned_tree = []

for node in tree:
    node.tail = ''
    cleaned_tree.append(ET.tostring(node))

print cleaned_tree # or print(cleaned_tree) if Python 3
['<tag>some text</tag>', '<tag>some text</tag>']

附带说明：您可以查看 str(cleanup) 并发现它在我的样本中缺少像 root 这样的标签.失败 fromstring() 可能暗示您的 xml 来源有问题。

python 删除 xml 中的非标签

python remove non tags in xml

python

regex

xml