在循环 python 中解析大 XML

Question

我想用以下结构解析 XML (~1GB)：

<Publication creationDateTime="04-AUG-2019 05:22:07">
  <holds>
    <hold>
      <recordType>Standard</recordType>
      <isEnroute>true</isEnroute>
      <holdName>NANLANG</holdName>
      <holdTime>10</holdTime>
      <inbound>
        <courseValue>170</courseValue>
      </inbound>
      <min>
        <altitude>7874</altitude>
      </min>
    </hold>
    <hold>
      <recordType>Standard</recordType>
      <holdName>ZILINA LOM</holdName>
      <holdTime>10</holdTime>
      <inbound>
        <courseValue>243</courseValue>
      </inbound>
      <max>
        <isFlightLevel>true</isFlightLevel>
        <altitude>85</altitude>
      </max>
      <min>
        <altitude>4500</altitude>
      </min>
    </hold>
  </holds>
</Publication>

我已经清除了，最有效的方法是使用lxml.etree iterparse method。

我需要将每个标签解析为变量，然后插入到数据库中。问题是我没有理解我可以遍历 'head' 标记（例如保留）并插入数据库的方式，我的代码示例如下：

class Avia:
    def __init__(self, **kwargs):
        for attr in kwargs.keys():
            self.__dict__[attr] = kwargs[attr]

context = ET.iterparse('test.xml')

def xml_fast_iter(context):
    for event, elem in context:
        if elem.tag == 'holdName':
            hold_name = elem.text
        elif elem.tag == 'holdTime':
            hold_time = elem.text
        elif elem.tag == 'courseValue':
            course = float(elem.text)
        elif elem.tag == 'isEnroute':
            hold_enr = elem.text
        # ...

        elem.clear()
        for ancestor in elem.xpath('ancestor-or-self::*'):
            if ancestor.tag == 'min':
                bottom = alt
            if ancestor.tag == 'max':
                top = alt

            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]

        if elem.tag == 'hold':
            hold_type = 'TER'
            if hold_enr:
                hold_type = 'ENR'
            outbound = course + 180 if course + 180 < 360 else course - 180
            holdPattern = Avia(name=hold_name, time=hold_time, course=course, outbound=outbound, type=hold_type, bottom=bottom, top=top)
            prop_dict = holdPattern.__dict__
            print(prop_dict)
    del context

当尝试打印时，我显然得到了第二个对象的 hold_type = 'ENR' 因为 hold_enr 对于第一个对象是正确的并且它没有改变而第二个没有这个键......当尝试将 None 分配给 for event, elem in context: 之后的所有变量，我将获得所有值=None 除了最后一个，因为它们遍历每个元素。

解析所有键和初始化对象的正确方法是什么？也许我的方法完全错误？

在初始化后将None赋值给变量是否正确？（那么hold_type是正确的）

Answer 1

同时监听 'start' 事件并在此时初始化您的变量。

使用可以按原样传递给 Avia() 的 dict 很方便，使用生成器函数（即 yield hold）也很方便。

def xml_fast_iter(xmlfile):
    context = ET.iterparse(xmlfile, events=('start', 'end'))

    for event, elem in context:
        if event == 'start':
            if elem.tag == 'hold':
                hold = {
                    'name': 'define',
                    'time': 'all',
                    'course': 'defaults',
                    'outbound': 'here',
                    'type': 'TER',
                    'bottom': '',
                    'top': '',
                }
                max = {
                    'altitude': 0
                }
                min = {
                    'altitude': 0
                }
            if elem.tag == 'max':
                minmax = max
            if elem.tag == 'min':
                minmax = min
        else:
            if elem.tag == 'holdName':
                hold['name'] = elem.text
            elif elem.tag == 'holdTime':
                hold['time'] = elem.text
            elif elem.tag == 'courseValue':
                course = float(elem.text)
                hold['course'] = course
                hold['outbound'] = course + 180 if course + 180 < 360 else course - 180
            elif elem.tag == 'isEnroute':
                hold['type'] = 'ENR'
            elif elem.tag == 'altitude':
                minmax['altitude'] = int(elem.text)
            elif elem.tag == 'hold':
                hold['bottom'] = min['altitude']
                hold['top'] = max['altitude']
                yield hold

用法：

for hold in xml_fast_iter('test.xml'):
    holdPattern = Avia(**hold)
    prop_dict = holdPattern.__dict__
    print(prop_dict)

请注意，我对 'max' 和 'min' 所做的只是对您的需求的猜测，但它展示了如何在不求助于 XPath 的情况下处理上下文数据像 ./ancestor-or-self::*，相比之下会慢很多。

在循环 python 中解析大 XML

parse large XML in loop python

lxml

for-loop

xml-parsing

python-3.x