如何使用带 lxml 和 python 的 pre-existing etree 元素创建 xml 文档?

How do I create an xml document using pre-existing etree elements with lxml and python?

我正在处理一个大型(≈ 50 MB)XML 文件,其中包含按字母数字排序(按单词标题)word-definition 条目,格式如下:

<xml>

    <p>
        <word>Word1</word>
        <pos>word #1 part of speech</pos>
        <def>Definition for word #1</def>
    </p>

    <p>
        <word>Word2</word>
        <pos>word #2 part of speech</pos>
        <def>Definition for word #2</def>
    </p>

    <p>
        <word>Word3</word>
        <pos>word #3 part of speech</pos>
        <def>Definition for word #3</def>
    </p>
    .....
    <p>
        <word>Word3812089</word>
        <pos>word #3812089 part of speech</pos>
        <def>Definition for word #3812089</def>
    </p>

</xml>

假设所有以相同字母开头的单词都彼此相邻,我如何将这个文件按首字母拆分为 26 个单独的 XML 文件? 例如,如果我有这样的文件:

<words>

    <p>
        <word>Bar</word>
        <pos>n. </pos>
        <def>A straight piece of something</def>
    </p>

    <p>
        <word>Bear</word>
        <pos>n.</pos>
        <def>A large furry predator.</def>
    </p>

    <p>
        <word>Cat</word>
        <pos>n.</pos>
        <def>A small domesticated furry mammal</def>
    </p>

    <p>
        <word>Dim</word>
        <pos>adj.</pos>
        <def>Lacking in illumination.</def>
    </p>

</words>

我怎么能把它变成这些:

<words_b>
    <p>
        <word>Bar</word>
        <pos>n. </pos>
        <def>A straight piece of something</def>
    </p>

    <p>
        <word>Bear</word>
        <pos>n.</pos>
        <def>A large furry predator.</def>
    </p>
</words_b>
<words_c>
    <p>
        <word>Cat</word>
        <pos>n.</pos>
        <def>A small domesticated furry mammal</def>
    </p>
</words_c>
<words_d>
    <p>
        <word>Dim</word>
        <pos>adj.</pos>
        <def>Lacking in illumination.</def>
    </p>
</words_d>

这似乎是一个分组问题,您可以使用 itertools.groupby:

来解决
from lxml import etree as ET

import itertools as IT

xml = '''<words>
    <p>
        <word>Bar</word>
        <pos>n. </pos>
        <def>A straight piece of something</def>
    </p>
    <p>
        <word>Bear</word>
        <pos>n.</pos>
        <def>A large furry predator.</def>
    </p>
    <p>
        <word>Cat</word>
        <pos>n.</pos>
        <def>A small domesticated furry mammal</def>
    </p>
    <p>
        <word>Dim</word>
        <pos>adj.</pos>
        <def>Lacking in illumination.</def>
    </p>
</words>'''

words = ET.fromstring(xml)

for key, group in IT.groupby(words, lambda w: w[0].text[0]):
    group_element = ET.Element('words_' + key)
    for item in group:
        group_element.append(item)
    ET.dump(group_element, pretty_print = True)

您当然可以将其写入文件,而不是转储 group_element