通过 lxml etree 提取 Raw XML
Extracting Raw XML via lxml etree
我正在尝试从 XML 文件中提取原始 XML。
所以如果我的数据是:
<xml>
... Lots of XML ...
<getThese>
<clonedKey>1</clonedKey>
<clonedKey>2</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>this is a sentence</randomStuff>
</getThese>
<getThese>
<clonedKey>6</clonedKey>
<clonedKey>8</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>more words</randomStuff>
</getThese>
... Lots of XML ...
</xml>
我可以使用 etree 轻松获得我想要的密钥:
from lxml import etree
search_me = etree.fromstring(xml_str)
search_me.findall('./xml/getThis')
但是我如何获得原始的实际内容 XML?我在文档中看到的所有内容都是为了获取 elements/text/attributes 而不是原始的 XML.
我想要的输出是一个包含两个元素的列表:
["<getThese>
<clonedKey>1</clonedKey>
<clonedKey>2</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>this is a sentence</randomStuff>
</getThese>",
"<getThese>
<clonedKey>6</clonedKey>
<clonedKey>8</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>more words</randomStuff>
</getThese>"]
您应该可以使用 tostring() to serialize 和 XML。
示例...
from lxml import etree
xml = """
<xml>
<getThese>
<clonedKey>1</clonedKey>
<clonedKey>2</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>this is a sentence</randomStuff>
</getThese>
<getThese>
<clonedKey>6</clonedKey>
<clonedKey>8</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>more words</randomStuff>
</getThese>
</xml>
"""
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.fromstring(xml, parser=parser)
elems = []
for elem in tree.xpath("getThese"):
elems.append(etree.tostring(elem).decode())
print(elems)
打印输出...
['<getThese><clonedKey>1</clonedKey><clonedKey>2</clonedKey><clonedKey>3</clonedKey><randomStuff>this is a sentence</randomStuff></getThese>', '<getThese><clonedKey>6</clonedKey><clonedKey>8</clonedKey><clonedKey>3</clonedKey><randomStuff>more words</randomStuff></getThese>']
我正在尝试从 XML 文件中提取原始 XML。
所以如果我的数据是:
<xml>
... Lots of XML ...
<getThese>
<clonedKey>1</clonedKey>
<clonedKey>2</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>this is a sentence</randomStuff>
</getThese>
<getThese>
<clonedKey>6</clonedKey>
<clonedKey>8</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>more words</randomStuff>
</getThese>
... Lots of XML ...
</xml>
我可以使用 etree 轻松获得我想要的密钥:
from lxml import etree
search_me = etree.fromstring(xml_str)
search_me.findall('./xml/getThis')
但是我如何获得原始的实际内容 XML?我在文档中看到的所有内容都是为了获取 elements/text/attributes 而不是原始的 XML.
我想要的输出是一个包含两个元素的列表:
["<getThese>
<clonedKey>1</clonedKey>
<clonedKey>2</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>this is a sentence</randomStuff>
</getThese>",
"<getThese>
<clonedKey>6</clonedKey>
<clonedKey>8</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>more words</randomStuff>
</getThese>"]
您应该可以使用 tostring() to serialize 和 XML。
示例...
from lxml import etree
xml = """
<xml>
<getThese>
<clonedKey>1</clonedKey>
<clonedKey>2</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>this is a sentence</randomStuff>
</getThese>
<getThese>
<clonedKey>6</clonedKey>
<clonedKey>8</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>more words</randomStuff>
</getThese>
</xml>
"""
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.fromstring(xml, parser=parser)
elems = []
for elem in tree.xpath("getThese"):
elems.append(etree.tostring(elem).decode())
print(elems)
打印输出...
['<getThese><clonedKey>1</clonedKey><clonedKey>2</clonedKey><clonedKey>3</clonedKey><randomStuff>this is a sentence</randomStuff></getThese>', '<getThese><clonedKey>6</clonedKey><clonedKey>8</clonedKey><clonedKey>3</clonedKey><randomStuff>more words</randomStuff></getThese>']