如何根据 Python 条件查找和删除 XML 文件(name_spaces)中的元素
How to find and remove elements in XML file (with name_spaces) by condition with Python
我有一个 XML 文件,我想根据条件从中删除元素。但是,XML 文件的命名空间由于某些不明确的原因不允许我执行描述的过程:, 2, , and .
我的 XML 看起来像这样:
<?xml version='1.0' encoding='UTF-8'?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
<Page imageFilename="1.png">
<TextRegion custom="a">
<TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
<TextEquiv>
<Unicode> abc </Unicode>
</TextEquiv>
</TextLine>
<TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">
<TextEquiv>
<Unicode />
</TextEquiv>
</TextRegion>
</Page>
</PcGts>
我的目标是清除 "Unicode" 标签中没有文本的所有 TextLine 节点。所以输出将是:
<?xml version='1.0' encoding='UTF-8'?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
<Page imageFilename="1.png">
<TextRegion custom="a">
<TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
<TextEquiv>
<Unicode> abc </Unicode>
</TextEquiv>
</TextLine>
</TextRegion>
</Page>
</PcGts>
我尝试使用上面链接中的一些建议。
但是:
import lxml.etree as ET
data = ET.parse(file)
root = data.getroot()
for x in root.xpath("//Unicode"):
print(x.text)
没有找到任何标签。
再试一次:
for x in root.xpath("//{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Unicode"):
print(x.text)
抛出 "XPathEvalError: Invalid expression"
嗯,从这个 XML 文件中删除所有 Unicode 标记为空的节点的最简单方法是什么(以及如何找到它们?)?
谢谢。
首先,您 xml 缺少 <TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">
的结束标记,但如果您将其插入适当的位置,则应使用以下内容:
my_xml = """[your xml above, corrected]"""
data = ET.XML(my_xml.encode('ascii'))
for target in data.xpath("//*[local-name() = 'Unicode'][not(text())]"):
target.getparent().remove(target)
print(etree.tostring(data, xml_declaration=True))
输出:
<?xml version=\'1.0\' encoding=\'ASCII\'?>\n
<PcGts
xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
<Page imageFilename="1.png">
<TextRegion custom="a">
<TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
<TextEquiv>
<Unicode> abc </Unicode>
</TextEquiv>
</TextLine>
<TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">
<TextEquiv/>
</TextLine>
</TextRegion>
</Page>
</PcGts>
好吧,我终于找到了解决问题的方法。
import lxml.etree as ET
my_xml = """...xml content..."""
data = ET.XML(my_xml.encode('UTF-8'))
#this loop remove "<Unicode />" tags.
for target in data.xpath("//*[local-name() = 'Unicode'][not(text())]"):
target.getparent().remove(target)
#and this loop remove nodes without children like "<TextEquiv><Unicode /></TextEquiv>"
#(after the removing of "<Unicode />")
for el in data.iter():
if len(list(el.iterchildren())) or ''.join([_.strip() for _ in el.itertext()]):
pass
else:
parent = el.getparent()
if parent is not None:
parent.remove(el)
#and this loop remove nodes without children again, but now - it's "<TextLine>" tag
for el in data.iter():
if len(list(el.iterchildren())) or ''.join([_.strip() for _ in el.itertext()]):
pass
else:
parent = el.getparent()
if parent is not None:
parent.remove(el)
print(ET.tostring(data, xml_declaration=True))
想法来自
我有一个 XML 文件,我想根据条件从中删除元素。但是,XML 文件的命名空间由于某些不明确的原因不允许我执行描述的过程:
我的 XML 看起来像这样:
<?xml version='1.0' encoding='UTF-8'?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
<Page imageFilename="1.png">
<TextRegion custom="a">
<TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
<TextEquiv>
<Unicode> abc </Unicode>
</TextEquiv>
</TextLine>
<TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">
<TextEquiv>
<Unicode />
</TextEquiv>
</TextRegion>
</Page>
</PcGts>
我的目标是清除 "Unicode" 标签中没有文本的所有 TextLine 节点。所以输出将是:
<?xml version='1.0' encoding='UTF-8'?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
<Page imageFilename="1.png">
<TextRegion custom="a">
<TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
<TextEquiv>
<Unicode> abc </Unicode>
</TextEquiv>
</TextLine>
</TextRegion>
</Page>
</PcGts>
我尝试使用上面链接中的一些建议。 但是:
import lxml.etree as ET
data = ET.parse(file)
root = data.getroot()
for x in root.xpath("//Unicode"):
print(x.text)
没有找到任何标签。 再试一次:
for x in root.xpath("//{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Unicode"):
print(x.text)
抛出 "XPathEvalError: Invalid expression"
嗯,从这个 XML 文件中删除所有 Unicode 标记为空的节点的最简单方法是什么(以及如何找到它们?)?
谢谢。
首先,您 xml 缺少 <TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">
的结束标记,但如果您将其插入适当的位置,则应使用以下内容:
my_xml = """[your xml above, corrected]"""
data = ET.XML(my_xml.encode('ascii'))
for target in data.xpath("//*[local-name() = 'Unicode'][not(text())]"):
target.getparent().remove(target)
print(etree.tostring(data, xml_declaration=True))
输出:
<?xml version=\'1.0\' encoding=\'ASCII\'?>\n
<PcGts
xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
<Page imageFilename="1.png">
<TextRegion custom="a">
<TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
<TextEquiv>
<Unicode> abc </Unicode>
</TextEquiv>
</TextLine>
<TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">
<TextEquiv/>
</TextLine>
</TextRegion>
</Page>
</PcGts>
好吧,我终于找到了解决问题的方法。
import lxml.etree as ET
my_xml = """...xml content..."""
data = ET.XML(my_xml.encode('UTF-8'))
#this loop remove "<Unicode />" tags.
for target in data.xpath("//*[local-name() = 'Unicode'][not(text())]"):
target.getparent().remove(target)
#and this loop remove nodes without children like "<TextEquiv><Unicode /></TextEquiv>"
#(after the removing of "<Unicode />")
for el in data.iter():
if len(list(el.iterchildren())) or ''.join([_.strip() for _ in el.itertext()]):
pass
else:
parent = el.getparent()
if parent is not None:
parent.remove(el)
#and this loop remove nodes without children again, but now - it's "<TextLine>" tag
for el in data.iter():
if len(list(el.iterchildren())) or ''.join([_.strip() for _ in el.itertext()]):
pass
else:
parent = el.getparent()
if parent is not None:
parent.remove(el)
print(ET.tostring(data, xml_declaration=True))
想法来自