如何使用带 lxml 和 python 的 pre-existing etree 元素创建 xml 文档?
How do I create an xml document using pre-existing etree elements with lxml and python?
我正在处理一个大型(≈ 50 MB)XML 文件,其中包含按字母数字排序(按单词标题)word-definition 条目,格式如下:
<xml>
<p>
<word>Word1</word>
<pos>word #1 part of speech</pos>
<def>Definition for word #1</def>
</p>
<p>
<word>Word2</word>
<pos>word #2 part of speech</pos>
<def>Definition for word #2</def>
</p>
<p>
<word>Word3</word>
<pos>word #3 part of speech</pos>
<def>Definition for word #3</def>
</p>
.....
<p>
<word>Word3812089</word>
<pos>word #3812089 part of speech</pos>
<def>Definition for word #3812089</def>
</p>
</xml>
假设所有以相同字母开头的单词都彼此相邻,我如何将这个文件按首字母拆分为 26 个单独的 XML 文件?
例如,如果我有这样的文件:
<words>
<p>
<word>Bar</word>
<pos>n. </pos>
<def>A straight piece of something</def>
</p>
<p>
<word>Bear</word>
<pos>n.</pos>
<def>A large furry predator.</def>
</p>
<p>
<word>Cat</word>
<pos>n.</pos>
<def>A small domesticated furry mammal</def>
</p>
<p>
<word>Dim</word>
<pos>adj.</pos>
<def>Lacking in illumination.</def>
</p>
</words>
我怎么能把它变成这些:
<words_b>
<p>
<word>Bar</word>
<pos>n. </pos>
<def>A straight piece of something</def>
</p>
<p>
<word>Bear</word>
<pos>n.</pos>
<def>A large furry predator.</def>
</p>
</words_b>
<words_c>
<p>
<word>Cat</word>
<pos>n.</pos>
<def>A small domesticated furry mammal</def>
</p>
</words_c>
<words_d>
<p>
<word>Dim</word>
<pos>adj.</pos>
<def>Lacking in illumination.</def>
</p>
</words_d>
这似乎是一个分组问题,您可以使用 itertools.groupby
:
来解决
from lxml import etree as ET
import itertools as IT
xml = '''<words>
<p>
<word>Bar</word>
<pos>n. </pos>
<def>A straight piece of something</def>
</p>
<p>
<word>Bear</word>
<pos>n.</pos>
<def>A large furry predator.</def>
</p>
<p>
<word>Cat</word>
<pos>n.</pos>
<def>A small domesticated furry mammal</def>
</p>
<p>
<word>Dim</word>
<pos>adj.</pos>
<def>Lacking in illumination.</def>
</p>
</words>'''
words = ET.fromstring(xml)
for key, group in IT.groupby(words, lambda w: w[0].text[0]):
group_element = ET.Element('words_' + key)
for item in group:
group_element.append(item)
ET.dump(group_element, pretty_print = True)
您当然可以将其写入文件,而不是转储 group_element
。
我正在处理一个大型(≈ 50 MB)XML 文件,其中包含按字母数字排序(按单词标题)word-definition 条目,格式如下:
<xml>
<p>
<word>Word1</word>
<pos>word #1 part of speech</pos>
<def>Definition for word #1</def>
</p>
<p>
<word>Word2</word>
<pos>word #2 part of speech</pos>
<def>Definition for word #2</def>
</p>
<p>
<word>Word3</word>
<pos>word #3 part of speech</pos>
<def>Definition for word #3</def>
</p>
.....
<p>
<word>Word3812089</word>
<pos>word #3812089 part of speech</pos>
<def>Definition for word #3812089</def>
</p>
</xml>
假设所有以相同字母开头的单词都彼此相邻,我如何将这个文件按首字母拆分为 26 个单独的 XML 文件? 例如,如果我有这样的文件:
<words>
<p>
<word>Bar</word>
<pos>n. </pos>
<def>A straight piece of something</def>
</p>
<p>
<word>Bear</word>
<pos>n.</pos>
<def>A large furry predator.</def>
</p>
<p>
<word>Cat</word>
<pos>n.</pos>
<def>A small domesticated furry mammal</def>
</p>
<p>
<word>Dim</word>
<pos>adj.</pos>
<def>Lacking in illumination.</def>
</p>
</words>
我怎么能把它变成这些:
<words_b>
<p>
<word>Bar</word>
<pos>n. </pos>
<def>A straight piece of something</def>
</p>
<p>
<word>Bear</word>
<pos>n.</pos>
<def>A large furry predator.</def>
</p>
</words_b>
<words_c>
<p>
<word>Cat</word>
<pos>n.</pos>
<def>A small domesticated furry mammal</def>
</p>
</words_c>
<words_d>
<p>
<word>Dim</word>
<pos>adj.</pos>
<def>Lacking in illumination.</def>
</p>
</words_d>
这似乎是一个分组问题,您可以使用 itertools.groupby
:
from lxml import etree as ET
import itertools as IT
xml = '''<words>
<p>
<word>Bar</word>
<pos>n. </pos>
<def>A straight piece of something</def>
</p>
<p>
<word>Bear</word>
<pos>n.</pos>
<def>A large furry predator.</def>
</p>
<p>
<word>Cat</word>
<pos>n.</pos>
<def>A small domesticated furry mammal</def>
</p>
<p>
<word>Dim</word>
<pos>adj.</pos>
<def>Lacking in illumination.</def>
</p>
</words>'''
words = ET.fromstring(xml)
for key, group in IT.groupby(words, lambda w: w[0].text[0]):
group_element = ET.Element('words_' + key)
for item in group:
group_element.append(item)
ET.dump(group_element, pretty_print = True)
您当然可以将其写入文件,而不是转储 group_element
。