Python 从 XML 高效地提取嵌套元素
Python Extract Nested Elements from XML Efficiently
我正在尝试解析大量包含大量嵌套元素的 XML 文件,以收集特定信息以供日后使用。由于文件数量众多,我试图尽可能高效地执行此操作以减少处理时间。我可以使用 xpath 提取所需的信息,如下所示,但效率似乎很低。尤其是必须 运行 第二个 for 循环才能使用另一个 xpath 搜索来提取结果值。我读了这个 post Efficient way to iterate through xml elements and this article High-performance XML parsing in Python with lxml 但不明白如何将它应用到我的用例中。有没有更有效的方法可以用来获得下面所需的输出?我可以通过单个 xpath 查询收集我需要的信息吗?
所需的解析格式:
Id Object Type Result
Packages total totalPackages 1200
DeliveryMethod priority packagesSent 100
DeliveryMethod express packagesSent 200
DeliveryMethod ground packagesSent 300
DeliveryMethod priority packagesReceived 100
DeliveryMethod express packagesReceived 200
DeliveryMethod ground packagesReceived 300
XML 样本:
<?xml version="1.0" encoding="utf-8"?>
<Data>
<Location localDn="Chicago"/>
<Info Id="Packages">
<job jobId="1"/>
<Type pos="1">totalPackages</Type>
<Value Object="total">
<result pos="1">1200</result>
</Value>
</Info>
<Info Id="DeliveryMethod">
<job jobId="1"/>
<Type pos="1">packagesSent</Type>
<Type pos="2">packagesReceived</Type>
<Value Object="priority">
<result pos="1">100</result>
<result pos="2">100</result>
</Value>
<Value Object="express">
<result pos="1">200</result>
<result pos="2">200</result>
</Value>
<Value Object="ground">
<result pos="1">300</result>
<result pos="2">300</result>
</Value>
</Info>
</Data>
我的方法:
from lxml import etree
xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()
for elem in tree.xpath('//*'):
if elem.tag == 'Type':
for value in tree.xpath(f'//*/Info[@Id="{elem.getparent().attrib["Id"]}"]/Value/result[@pos="{elem.attrib["pos"]}"]'):
print(elem.getparent().attrib['Id'], value.getparent().attrib['Object'], elem.text, value.text)
当前输出:
Packages total totalPackages 1200
DeliveryMethod priority packagesSent 100
DeliveryMethod express packagesSent 200
DeliveryMethod ground packagesSent 300
DeliveryMethod priority packagesReceived 100
DeliveryMethod express packagesReceived 200
DeliveryMethod ground packagesReceived 300
是否可以通过遍历tree.xpath('//*')
获取所有信息?
如果您不遍历所有标签 (//*
),也许性能会更好,但只是 <Value>
s:
from lxml import etree
xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()
for val in tree.xpath('//Value'):
t = {t.get('pos'): t.text for t in val.getparent().xpath('./Type')}
for r in val.xpath('./result'):
print(val.getparent().get('Id'), val.get('Object'), t[r.get('pos')], r.text)
打印:
Packages total totalPackages 1200
DeliveryMethod priority packagesSent 100
DeliveryMethod priority packagesReceived 100
DeliveryMethod express packagesSent 200
DeliveryMethod express packagesReceived 200
DeliveryMethod ground packagesSent 300
DeliveryMethod ground packagesReceived 300
其中一项优化将不会像您现在使用 tree.xpath('//*')
那样遍历所有标签并使用 if 语句进行检查。这可以替换为 tree.xpath('//Type')
接下来需要优化的是遍历值。与其一遍又一遍地遍历所有 Value
(tree.xpath('//Value')
),您可以获得所有 Values
标记 Type
与 elem.xpath('./following-sibling::Value')
[=28= 的兄弟姐妹]
from lxml import etree
xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()
for elem in tree.xpath('//Type'):
_id = elem.getparent().attrib["Id"]
_type = elem.text
_position = elem.attrib["pos"]
values = elem.xpath('./following-sibling::Value')
for value in values:
_object = value.attrib['Object']
_result = value.xpath(f'./result[@pos={_position}]/text()')[0]
print(_id, _type, _object, _result)
将打印出:
Packages totalPackages total 1200
DeliveryMethod packagesSent priority 100
DeliveryMethod packagesSent express 200
DeliveryMethod packagesSent ground 300
DeliveryMethod packagesReceived priority 100
DeliveryMethod packagesReceived express 200
DeliveryMethod packagesReceived ground 300
编辑
这是针对特定情况的解决方案,我们确定 Value
标签中 result
的数量等于 Value
标签的同级标签数量 Type
] 另外解决方案假设 Type
和 result
由相同的 pos
属性排序。
请记住,这是非常具体的解决方案,而不是通用的解决方案。
from lxml import etree
xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()
for elem in tree.xpath('//Type'):
_id = elem.getparent().attrib["Id"]
_type = elem.text
_objects = elem.xpath('./following-sibling::Value/@Object')
_results = elem.xpath('./following-sibling::Value/result/text()')
for _object, _result in zip(_objects, _results):
print(_id, _type, _object, _result)
输出:
Packages totalPackages total 1200
DeliveryMethod packagesSent priority 100
DeliveryMethod packagesSent express 100
DeliveryMethod packagesSent ground 200
DeliveryMethod packagesReceived priority 100
DeliveryMethod packagesReceived express 100
DeliveryMethod packagesReceived ground 200
我正在尝试解析大量包含大量嵌套元素的 XML 文件,以收集特定信息以供日后使用。由于文件数量众多,我试图尽可能高效地执行此操作以减少处理时间。我可以使用 xpath 提取所需的信息,如下所示,但效率似乎很低。尤其是必须 运行 第二个 for 循环才能使用另一个 xpath 搜索来提取结果值。我读了这个 post Efficient way to iterate through xml elements and this article High-performance XML parsing in Python with lxml 但不明白如何将它应用到我的用例中。有没有更有效的方法可以用来获得下面所需的输出?我可以通过单个 xpath 查询收集我需要的信息吗?
所需的解析格式:
Id Object Type Result
Packages total totalPackages 1200
DeliveryMethod priority packagesSent 100
DeliveryMethod express packagesSent 200
DeliveryMethod ground packagesSent 300
DeliveryMethod priority packagesReceived 100
DeliveryMethod express packagesReceived 200
DeliveryMethod ground packagesReceived 300
XML 样本:
<?xml version="1.0" encoding="utf-8"?>
<Data>
<Location localDn="Chicago"/>
<Info Id="Packages">
<job jobId="1"/>
<Type pos="1">totalPackages</Type>
<Value Object="total">
<result pos="1">1200</result>
</Value>
</Info>
<Info Id="DeliveryMethod">
<job jobId="1"/>
<Type pos="1">packagesSent</Type>
<Type pos="2">packagesReceived</Type>
<Value Object="priority">
<result pos="1">100</result>
<result pos="2">100</result>
</Value>
<Value Object="express">
<result pos="1">200</result>
<result pos="2">200</result>
</Value>
<Value Object="ground">
<result pos="1">300</result>
<result pos="2">300</result>
</Value>
</Info>
</Data>
我的方法:
from lxml import etree
xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()
for elem in tree.xpath('//*'):
if elem.tag == 'Type':
for value in tree.xpath(f'//*/Info[@Id="{elem.getparent().attrib["Id"]}"]/Value/result[@pos="{elem.attrib["pos"]}"]'):
print(elem.getparent().attrib['Id'], value.getparent().attrib['Object'], elem.text, value.text)
当前输出:
Packages total totalPackages 1200
DeliveryMethod priority packagesSent 100
DeliveryMethod express packagesSent 200
DeliveryMethod ground packagesSent 300
DeliveryMethod priority packagesReceived 100
DeliveryMethod express packagesReceived 200
DeliveryMethod ground packagesReceived 300
是否可以通过遍历tree.xpath('//*')
获取所有信息?
如果您不遍历所有标签 (//*
),也许性能会更好,但只是 <Value>
s:
from lxml import etree
xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()
for val in tree.xpath('//Value'):
t = {t.get('pos'): t.text for t in val.getparent().xpath('./Type')}
for r in val.xpath('./result'):
print(val.getparent().get('Id'), val.get('Object'), t[r.get('pos')], r.text)
打印:
Packages total totalPackages 1200
DeliveryMethod priority packagesSent 100
DeliveryMethod priority packagesReceived 100
DeliveryMethod express packagesSent 200
DeliveryMethod express packagesReceived 200
DeliveryMethod ground packagesSent 300
DeliveryMethod ground packagesReceived 300
其中一项优化将不会像您现在使用 tree.xpath('//*')
那样遍历所有标签并使用 if 语句进行检查。这可以替换为 tree.xpath('//Type')
接下来需要优化的是遍历值。与其一遍又一遍地遍历所有 Value
(tree.xpath('//Value')
),您可以获得所有 Values
标记 Type
与 elem.xpath('./following-sibling::Value')
[=28= 的兄弟姐妹]
from lxml import etree
xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()
for elem in tree.xpath('//Type'):
_id = elem.getparent().attrib["Id"]
_type = elem.text
_position = elem.attrib["pos"]
values = elem.xpath('./following-sibling::Value')
for value in values:
_object = value.attrib['Object']
_result = value.xpath(f'./result[@pos={_position}]/text()')[0]
print(_id, _type, _object, _result)
将打印出:
Packages totalPackages total 1200
DeliveryMethod packagesSent priority 100
DeliveryMethod packagesSent express 200
DeliveryMethod packagesSent ground 300
DeliveryMethod packagesReceived priority 100
DeliveryMethod packagesReceived express 200
DeliveryMethod packagesReceived ground 300
编辑
这是针对特定情况的解决方案,我们确定 Value
标签中 result
的数量等于 Value
标签的同级标签数量 Type
] 另外解决方案假设 Type
和 result
由相同的 pos
属性排序。
请记住,这是非常具体的解决方案,而不是通用的解决方案。
from lxml import etree
xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()
for elem in tree.xpath('//Type'):
_id = elem.getparent().attrib["Id"]
_type = elem.text
_objects = elem.xpath('./following-sibling::Value/@Object')
_results = elem.xpath('./following-sibling::Value/result/text()')
for _object, _result in zip(_objects, _results):
print(_id, _type, _object, _result)
输出:
Packages totalPackages total 1200
DeliveryMethod packagesSent priority 100
DeliveryMethod packagesSent express 100
DeliveryMethod packagesSent ground 200
DeliveryMethod packagesReceived priority 100
DeliveryMethod packagesReceived express 100
DeliveryMethod packagesReceived ground 200