未正确解析 python 中嵌套的 xml 标签
Not parsing correctly nested xml tags in python
我正在处理 python 中的 XML 个文件。我有一个包含多种语言的句子的数据集,其结构如下:
<corpus>
<sentence id="0">
<text lang="de">...</text>
<text lang="en">...</text>
<text lang="fr">...</text>
<!-- Other languages -->
<annotations>
<annotation lang="de">...</annotation>
<annotation lang="en">...</annotation>
<annotation lang="fr">...</annotation>
<!-- Other languages -->
</annotations>
</sentence>
<sentence id="1">
<!-- Other sentence -->
</sentence>
<!-- Other sentences -->
</corpus>
我想得到的是,从数据集开始,一个只包含英文句子和注释的新数据集("en"属性值"lang")。我试过这个解决方案:
import xml.etree.ElementTree as ET
tree = ET.parse('samplefile2.xml')
root = tree.getroot()
for sentence in root:
if sentence.tag == 'sentence':
for txt in sentence:
if txt.tag == 'text':
if txt.attrib['lang'] != 'en':
sentence.remove(txt)
if txt.tag == 'annotations':
for annotation in txt:
if annotation.attrib['lang'] != 'en':
txt.remove(annotation)
tree.write('output.xml')
但是好像只对text
属性级别有效,对annotation
属性级别无效。我什至尝试用增量索引 root[s], root[s][t], root[s][t][a]
替换 sentence, txt, annotation
等解决方案元素的 python 端,但它没有任何效果。此外,我提供的 python 代码在 xml 文件中随机插入(老实说,我不知道这是否有助于解决这个问题)像 δημιουργία
这样的字符串。
所以,我坚信问题出在嵌套标签中,但我无法弄清楚。一些想法?
如果您能够使用 lxml,我认为使用 xpath 会更容易...
XML 输入 (input.xml)
<corpus>
<sentence id="0">
<text lang="de">...</text>
<text lang="en">...</text>
<text lang="fr">...</text>
<!-- Other languages -->
<annotations>
<annotation lang="de">...</annotation>
<annotation lang="en">...</annotation>
<annotation lang="fr">...</annotation>
<!-- Other languages -->
</annotations>
</sentence>
<sentence id="1">
<!-- Other sentence -->
</sentence>
<!-- Other sentences -->
</corpus>
Python
from lxml import etree
target_lang = "en"
tree = etree.parse("input.xml")
# Match any element that has a child that has a lang attribute with a value other than
# target_lang. We need this element so we can remove the child from it.
for parent in tree.xpath(f".//*[*[@lang != '{target_lang}']]"):
# Match the children that have a lang attribute with a value other than target_lang.
for child in parent.xpath(f"*[@lang != '{target_lang}']"):
# Remove the child from the parent.
parent.remove(child)
tree.write("output.xml")
XML 输出 (output.xml)
<corpus>
<sentence id="0">
<text lang="en">...</text>
<!-- Other languages -->
<annotations>
<annotation lang="en">...</annotation>
<!-- Other languages -->
</annotations>
</sentence>
<sentence id="1">
<!-- Other sentence -->
</sentence>
<!-- Other sentences -->
</corpus>
我正在处理 python 中的 XML 个文件。我有一个包含多种语言的句子的数据集,其结构如下:
<corpus>
<sentence id="0">
<text lang="de">...</text>
<text lang="en">...</text>
<text lang="fr">...</text>
<!-- Other languages -->
<annotations>
<annotation lang="de">...</annotation>
<annotation lang="en">...</annotation>
<annotation lang="fr">...</annotation>
<!-- Other languages -->
</annotations>
</sentence>
<sentence id="1">
<!-- Other sentence -->
</sentence>
<!-- Other sentences -->
</corpus>
我想得到的是,从数据集开始,一个只包含英文句子和注释的新数据集("en"属性值"lang")。我试过这个解决方案:
import xml.etree.ElementTree as ET
tree = ET.parse('samplefile2.xml')
root = tree.getroot()
for sentence in root:
if sentence.tag == 'sentence':
for txt in sentence:
if txt.tag == 'text':
if txt.attrib['lang'] != 'en':
sentence.remove(txt)
if txt.tag == 'annotations':
for annotation in txt:
if annotation.attrib['lang'] != 'en':
txt.remove(annotation)
tree.write('output.xml')
但是好像只对text
属性级别有效,对annotation
属性级别无效。我什至尝试用增量索引 root[s], root[s][t], root[s][t][a]
替换 sentence, txt, annotation
等解决方案元素的 python 端,但它没有任何效果。此外,我提供的 python 代码在 xml 文件中随机插入(老实说,我不知道这是否有助于解决这个问题)像 δημιουργία
这样的字符串。
所以,我坚信问题出在嵌套标签中,但我无法弄清楚。一些想法?
如果您能够使用 lxml,我认为使用 xpath 会更容易...
XML 输入 (input.xml)
<corpus>
<sentence id="0">
<text lang="de">...</text>
<text lang="en">...</text>
<text lang="fr">...</text>
<!-- Other languages -->
<annotations>
<annotation lang="de">...</annotation>
<annotation lang="en">...</annotation>
<annotation lang="fr">...</annotation>
<!-- Other languages -->
</annotations>
</sentence>
<sentence id="1">
<!-- Other sentence -->
</sentence>
<!-- Other sentences -->
</corpus>
Python
from lxml import etree
target_lang = "en"
tree = etree.parse("input.xml")
# Match any element that has a child that has a lang attribute with a value other than
# target_lang. We need this element so we can remove the child from it.
for parent in tree.xpath(f".//*[*[@lang != '{target_lang}']]"):
# Match the children that have a lang attribute with a value other than target_lang.
for child in parent.xpath(f"*[@lang != '{target_lang}']"):
# Remove the child from the parent.
parent.remove(child)
tree.write("output.xml")
XML 输出 (output.xml)
<corpus>
<sentence id="0">
<text lang="en">...</text>
<!-- Other languages -->
<annotations>
<annotation lang="en">...</annotation>
<!-- Other languages -->
</annotations>
</sentence>
<sentence id="1">
<!-- Other sentence -->
</sentence>
<!-- Other sentences -->
</corpus>