一旦找到某个标签,如何正确拆分 XML 文件(分成几个其他文件)?
How to split an XML file (into several other files) properly once a certain tag has been found?
问题:
我试图通过在找到标签后重写它来拆分 XML。然而,结果并不正确,因为在遍历元素然后将它们添加到新的 ET 中时,它并没有复制它们的子元素。一旦 iter 通过该元素,最终会添加子项,因此即使我找到了将子项添加到新 ET 后进行复制的方法,它最终也会成为重复项。
我尝试过的:
我尝试这样做,用 lxml 的 ElementTree 解析 XML,然后遍历元素。
如果元素的标签不匹配,则将该元素记录到一个ET对象中,然后使用tostring将其写下来。一旦迭代的元素与我希望 XML 拆分的标签相匹配,它将更改文件的名称并通过将其记录到新文件中来有效地 'split'。
from lxml import etree as ET
parser = ET.XMLParser()
context = ET.parse('activity-list(2).xml', parser=parser)
index = 0
root = context.getroot()
new_data = ET.Element('iati-activity')
for elem in context.iter('iati-activity'):
for element in list(elem.iter()):
if element.tag == 'iati-identifier':
print("PASSED HERE")
index = index + 1
filename = format(str(index) + ".xml")
print("ELEMENT IS", element.tag)
new_sub = ET.SubElement(new_data, element.tag, attrib =
element.attrib)
new_sub.text = element.text
with open(filename, 'wb') as f:
f.write(ET.tostring(new_data))
编辑 --
XML 结构(输入):
<iati-activities version="2.03>
<iati-activity>
<iati-identifier>
<title>
<narrative>
</narrative>
</title>
</iati-identifier>
<iati-identifier>
<title>
<narrative>
</narrative>
</title>
</iati-identifier>
</iati-activity>
</iati-activities>
XML 结构(输出 - 电流)
<iati-activities version="2.03>
<iati-activity>
<iati-identifier>
<title>
</title>
<narrative>
</narrative>
</iati-identifier>
</iati-activity>
</iati-activities>
... Same structure is created in second file with next iati-identifier's data
当前输入:
<iati-activity>
<iati-identifier>XM-DAC-6-4-011077</iati-identifier>
<reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
<narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
<narrative>Italian Agency for Development Cooperation</narrative>
</reporting-org>
<title>
<narrative>Protracted relief and recovery operation</narrative>
<narrative xml:lang="it">Protracted relief and recovery operation </narrative>
</title>
<description>
<narrative>Protracted relief and recovery operation</narrative>
</description>
<description>
<narrative xml:lang="it">Protracted relief and recovery operation </narrative>
</description>
<participating-org ref="XM-DAC-6-4" type="10" role="1">
<narrative>AICS - Italian Agency for Cooperation and Development</narrative>
</participating-org>
<other-identifier ref="011077" type="A1">
<owner-org ref="XM-DAC-6-4">
<narrative>AICS</narrative>
</owner-org>
</other-identifier>
<activity-status code="2"/>
<activity-date iso-date="2017-05-01" type="1"/>
<activity-date iso-date="2018-04-30" type="3"/>
<contact-info type="1">
<organisation>
<narrative>AICS - Italian Agency for Cooperation and Development</narrative>
</organisation>
<telephone>+ 39 06 32492 305</telephone>
<email>info@aics.gov.it</email>
<mailing-address>
<narrative>via Salvatore Contarini 25, 00135 Roma</narrative>
</mailing-address>
</contact-info>
<recipient-country code="SO" percentage="100.00"/>
<location>
<location-reach code="1"/>
<location-id/>
<point/>
</location>
<collaboration-type code="3"/>
<related-activity ref="XM-DAC-6-4-011077-01-0" type="2"/>
<iati-identifier>XM-DAC-6-4-011077-01-0</iati-identifier>
<reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
<narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
<narrative>Italian Agency for Development Cooperation</narrative>
</reporting-org>
<title>
<narrative>Protracted relief and recovery operation</narrative>
<narrative xml:lang="it">Protracted relief and recovery operation</narrative>
</title>
<description>
<narrative>The scope of the program is to support the population on food security and resilience. In particular, to support local agricultural products and vulnerable families on food security.</narrative>
</description>
<description>
<narrative xml:lang="it">Contributo al PAM per il programma per la sicurezza alimentare e la resilienza. Le attività, che con programmi analoghi sono state realizzate già negli scorsi anni includono oltre al tradizionale aiuto alimentare, anche il sostegno alle attività generatrici di reddito, la realizzazione di infrastrutture, il sostegno ai produttori agricoli locali e il sostegno alle famiglie più vulnerabili, per l’acquisto di beni alimentari e non, nel mercato locale attraverso smartcard prepagate che includono anche i dati biometrici dei beneficiari</narrative>
</description>
<participating-org ref="XM-DAC-6-4" type="10" role="1">
<narrative>AICS - Italian Agency for Cooperation and Development</narrative>
</participating-org>
<participating-org ref="41140" type="40" role="4">
<narrative>WFP - WORLD FOOD PROGRAMME</narrative>
</participating-org>
<other-identifier ref="011077/01/0" type="A1">
<owner-org ref="XM-DAC-6-4">
<narrative>AICS</narrative>
</owner-org>
</other-identifier>
<activity-status code="2"/>
<activity-date iso-date="2017-05-02" type="1"/>
<activity-date iso-date="2018-04-30" type="3"/>
<contact-info type="1">
<organisation>
<narrative>AICS - Italian Agency for Cooperation and Development</narrative>
</organisation>
<telephone>+ 39 06 32492 305</telephone>
<email>info@aics.gov.it</email>
<mailing-address>
<narrative>via Salvatore Contarini 25, 00135 Roma</narrative>
</mailing-address>
</contact-info>
<recipient-country code="SO" percentage="100.00"/>
<sector code="52010" vocabulary="1" percentage="100.00"/>
<policy-marker vocabulary="1" code="1" significance="0">
<narrative>Gender Equality</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="2" significance="0">
<narrative>Aid to Environment</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="3" significance="2">
<narrative>Participatory Development/Good Governance</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="4" significance="0">
<narrative>Trade Development</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="5" significance="0">
<narrative>Aid Targeting the Objectives of the Convention on Biological Diversity</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="6" significance="0">
<narrative>Aid Targeting the Objectives of the Framework Convention on Climate Change - Mitigation</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="7" significance="0">
<narrative>Aid Targeting the Objectives of the Framework Convention on Climate Change - Adaptation</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="8" significance="0">
<narrative>Aid Targeting the Objectives of the Convention to Combat Desertification</narrative>
</policy-marker>
<collaboration-type code="3"/>
<default-flow-type code="10"/>
<default-finance-type code="110"/>
<related-activity ref="XM-DAC-6-4-011077" type="1"/>
</iati-activity>
预期输出:
<iati-activity>
<iati-identifier>XM-DAC-6-4-011077</iati-identifier>
<reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
<narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
<narrative>Italian Agency for Development Cooperation</narrative>
<title>
<narrative>Protracted relief and recovery operation</narrative>
<narrative xml:lang="it">Protracted relief and recovery operation
</narrative>
</title>
<description>
<narrative>Protracted relief and recovery operation</narrative>
</description>
</iati-activity>
... next XML starts with next <iati-identifier>
当前输出:
<iati-activity>
<iati-identifier>XM-DAC-6-4-011077</iati-identifier>
<reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
</reporting-org>
<narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
<narrative>Italian Agency for Development Cooperation</narrative>
<title>
</title>
<narrative>Protracted relief and recovery operation</narrative>
<narrative xml:lang="it">Protracted relief and recovery operation </narrative>
<description>
</description>
<narrative>Protracted relief and recovery operation</narrative>
</iati-activity>
考虑使用参数化 XSLT 将大型输入源按 <iati-identifier>
个节点拆分为单独的 XML 文件。 Python 的 lxml
可以 运行 XSLT 1.0 脚本,甚至可以将参数值从应用程序层传递到样式表(与在其他声明性 special-purpose 语言中传递参数不同 -SQL).
具体来说,Python 可以在 运行 为文档中的节点总数设置 XPath(XSLT 的同级)之后迭代传递每个 iati-identifier
的位置。 following-sibling::node_name[1]
用于按名称获取第一个相邻节点。
XSLT (另存为.xsl文件,一个special.xml文件)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output indent="yes"/>
<!-- XSL PARAM -->
<xsl:param name="item_num"/>
<xsl:template match="/iati-activity">
<xsl:apply-templates select="iati-identifier[position()=$item_num]"/>
</xsl:template>
<xsl:template match="iati-identifier">
<iati-activity>
<xsl:copy-of select="."/>
<xsl:copy-of select="following-sibling::reporting-org[1]"/>
<xsl:copy-of select="following-sibling::narrative[1]"/>
<xsl:copy-of select="following-sibling::title[1]"/>
<xsl:copy-of select="following-sibling::description[1]"/>
</iati-activity>
</xsl:template>
</xsl:stylesheet>
Python
import lxml.etree as ET
# LOAD XML AND XSL SCRIPT
xml = ET.parse('Input.xml')
xsl = ET.parse('Script.xsl')
transform = ET.XSLT(xsl)
# LOOP THROUGH ALL NODE COUNTS AND PASS PARAMETER TO XSLT
iati_count = len(xml.xpath('//iati-identifier'))
for i in range(iati_count):
n = ET.XSLT.strparam(str(i+1))
result = transform(xml, item_num=n) # NAME OF XSL PARAMETER
# SAVE XML TO FILE
with open('Output_{}.xml'.format(i+1), 'wb') as f:
f.write(result)
产出
Output_1.xml
<?xml version="1.0"?>
<iati-activity>
<iati-identifier>XM-DAC-6-4-011077</iati-identifier>
<reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
<narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
<narrative>Italian Agency for Development Cooperation</narrative>
</reporting-org>
<title>
<narrative>Protracted relief and recovery operation</narrative>
<narrative xml:lang="it">Protracted relief and recovery operation </narrative>
</title>
<description>
<narrative>Protracted relief and recovery operation</narrative>
</description>
</iati-activity>
Output_2.xml
<?xml version="1.0"?>
<iati-activity>
<iati-identifier>XM-DAC-6-4-011077-01-0</iati-identifier>
<reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
<narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
<narrative>Italian Agency for Development Cooperation</narrative>
</reporting-org>
<title>
<narrative>Protracted relief and recovery operation</narrative>
<narrative xml:lang="it">Protracted relief and recovery operation</narrative>
</title>
<description>
<narrative>The scope of the program is to support the population on food security and resilience. In particular, to support local agricultural products and vulnerable families on food security.</narrative>
</description>
</iati-activity>
问题:
我试图通过在找到标签后重写它来拆分 XML。然而,结果并不正确,因为在遍历元素然后将它们添加到新的 ET 中时,它并没有复制它们的子元素。一旦 iter 通过该元素,最终会添加子项,因此即使我找到了将子项添加到新 ET 后进行复制的方法,它最终也会成为重复项。
我尝试过的:
我尝试这样做,用 lxml 的 ElementTree 解析 XML,然后遍历元素。
如果元素的标签不匹配,则将该元素记录到一个ET对象中,然后使用tostring将其写下来。一旦迭代的元素与我希望 XML 拆分的标签相匹配,它将更改文件的名称并通过将其记录到新文件中来有效地 'split'。
from lxml import etree as ET
parser = ET.XMLParser()
context = ET.parse('activity-list(2).xml', parser=parser)
index = 0
root = context.getroot()
new_data = ET.Element('iati-activity')
for elem in context.iter('iati-activity'):
for element in list(elem.iter()):
if element.tag == 'iati-identifier':
print("PASSED HERE")
index = index + 1
filename = format(str(index) + ".xml")
print("ELEMENT IS", element.tag)
new_sub = ET.SubElement(new_data, element.tag, attrib =
element.attrib)
new_sub.text = element.text
with open(filename, 'wb') as f:
f.write(ET.tostring(new_data))
编辑 --
XML 结构(输入):
<iati-activities version="2.03>
<iati-activity>
<iati-identifier>
<title>
<narrative>
</narrative>
</title>
</iati-identifier>
<iati-identifier>
<title>
<narrative>
</narrative>
</title>
</iati-identifier>
</iati-activity>
</iati-activities>
XML 结构(输出 - 电流)
<iati-activities version="2.03>
<iati-activity>
<iati-identifier>
<title>
</title>
<narrative>
</narrative>
</iati-identifier>
</iati-activity>
</iati-activities>
... Same structure is created in second file with next iati-identifier's data
当前输入:
<iati-activity>
<iati-identifier>XM-DAC-6-4-011077</iati-identifier>
<reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
<narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
<narrative>Italian Agency for Development Cooperation</narrative>
</reporting-org>
<title>
<narrative>Protracted relief and recovery operation</narrative>
<narrative xml:lang="it">Protracted relief and recovery operation </narrative>
</title>
<description>
<narrative>Protracted relief and recovery operation</narrative>
</description>
<description>
<narrative xml:lang="it">Protracted relief and recovery operation </narrative>
</description>
<participating-org ref="XM-DAC-6-4" type="10" role="1">
<narrative>AICS - Italian Agency for Cooperation and Development</narrative>
</participating-org>
<other-identifier ref="011077" type="A1">
<owner-org ref="XM-DAC-6-4">
<narrative>AICS</narrative>
</owner-org>
</other-identifier>
<activity-status code="2"/>
<activity-date iso-date="2017-05-01" type="1"/>
<activity-date iso-date="2018-04-30" type="3"/>
<contact-info type="1">
<organisation>
<narrative>AICS - Italian Agency for Cooperation and Development</narrative>
</organisation>
<telephone>+ 39 06 32492 305</telephone>
<email>info@aics.gov.it</email>
<mailing-address>
<narrative>via Salvatore Contarini 25, 00135 Roma</narrative>
</mailing-address>
</contact-info>
<recipient-country code="SO" percentage="100.00"/>
<location>
<location-reach code="1"/>
<location-id/>
<point/>
</location>
<collaboration-type code="3"/>
<related-activity ref="XM-DAC-6-4-011077-01-0" type="2"/>
<iati-identifier>XM-DAC-6-4-011077-01-0</iati-identifier>
<reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
<narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
<narrative>Italian Agency for Development Cooperation</narrative>
</reporting-org>
<title>
<narrative>Protracted relief and recovery operation</narrative>
<narrative xml:lang="it">Protracted relief and recovery operation</narrative>
</title>
<description>
<narrative>The scope of the program is to support the population on food security and resilience. In particular, to support local agricultural products and vulnerable families on food security.</narrative>
</description>
<description>
<narrative xml:lang="it">Contributo al PAM per il programma per la sicurezza alimentare e la resilienza. Le attività, che con programmi analoghi sono state realizzate già negli scorsi anni includono oltre al tradizionale aiuto alimentare, anche il sostegno alle attività generatrici di reddito, la realizzazione di infrastrutture, il sostegno ai produttori agricoli locali e il sostegno alle famiglie più vulnerabili, per l’acquisto di beni alimentari e non, nel mercato locale attraverso smartcard prepagate che includono anche i dati biometrici dei beneficiari</narrative>
</description>
<participating-org ref="XM-DAC-6-4" type="10" role="1">
<narrative>AICS - Italian Agency for Cooperation and Development</narrative>
</participating-org>
<participating-org ref="41140" type="40" role="4">
<narrative>WFP - WORLD FOOD PROGRAMME</narrative>
</participating-org>
<other-identifier ref="011077/01/0" type="A1">
<owner-org ref="XM-DAC-6-4">
<narrative>AICS</narrative>
</owner-org>
</other-identifier>
<activity-status code="2"/>
<activity-date iso-date="2017-05-02" type="1"/>
<activity-date iso-date="2018-04-30" type="3"/>
<contact-info type="1">
<organisation>
<narrative>AICS - Italian Agency for Cooperation and Development</narrative>
</organisation>
<telephone>+ 39 06 32492 305</telephone>
<email>info@aics.gov.it</email>
<mailing-address>
<narrative>via Salvatore Contarini 25, 00135 Roma</narrative>
</mailing-address>
</contact-info>
<recipient-country code="SO" percentage="100.00"/>
<sector code="52010" vocabulary="1" percentage="100.00"/>
<policy-marker vocabulary="1" code="1" significance="0">
<narrative>Gender Equality</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="2" significance="0">
<narrative>Aid to Environment</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="3" significance="2">
<narrative>Participatory Development/Good Governance</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="4" significance="0">
<narrative>Trade Development</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="5" significance="0">
<narrative>Aid Targeting the Objectives of the Convention on Biological Diversity</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="6" significance="0">
<narrative>Aid Targeting the Objectives of the Framework Convention on Climate Change - Mitigation</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="7" significance="0">
<narrative>Aid Targeting the Objectives of the Framework Convention on Climate Change - Adaptation</narrative>
</policy-marker>
<policy-marker vocabulary="1" code="8" significance="0">
<narrative>Aid Targeting the Objectives of the Convention to Combat Desertification</narrative>
</policy-marker>
<collaboration-type code="3"/>
<default-flow-type code="10"/>
<default-finance-type code="110"/>
<related-activity ref="XM-DAC-6-4-011077" type="1"/>
</iati-activity>
预期输出:
<iati-activity>
<iati-identifier>XM-DAC-6-4-011077</iati-identifier>
<reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
<narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
<narrative>Italian Agency for Development Cooperation</narrative>
<title>
<narrative>Protracted relief and recovery operation</narrative>
<narrative xml:lang="it">Protracted relief and recovery operation
</narrative>
</title>
<description>
<narrative>Protracted relief and recovery operation</narrative>
</description>
</iati-activity>
... next XML starts with next <iati-identifier>
当前输出:
<iati-activity>
<iati-identifier>XM-DAC-6-4-011077</iati-identifier>
<reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
</reporting-org>
<narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
<narrative>Italian Agency for Development Cooperation</narrative>
<title>
</title>
<narrative>Protracted relief and recovery operation</narrative>
<narrative xml:lang="it">Protracted relief and recovery operation </narrative>
<description>
</description>
<narrative>Protracted relief and recovery operation</narrative>
</iati-activity>
考虑使用参数化 XSLT 将大型输入源按 <iati-identifier>
个节点拆分为单独的 XML 文件。 Python 的 lxml
可以 运行 XSLT 1.0 脚本,甚至可以将参数值从应用程序层传递到样式表(与在其他声明性 special-purpose 语言中传递参数不同 -SQL).
具体来说,Python 可以在 运行 为文档中的节点总数设置 XPath(XSLT 的同级)之后迭代传递每个 iati-identifier
的位置。 following-sibling::node_name[1]
用于按名称获取第一个相邻节点。
XSLT (另存为.xsl文件,一个special.xml文件)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output indent="yes"/>
<!-- XSL PARAM -->
<xsl:param name="item_num"/>
<xsl:template match="/iati-activity">
<xsl:apply-templates select="iati-identifier[position()=$item_num]"/>
</xsl:template>
<xsl:template match="iati-identifier">
<iati-activity>
<xsl:copy-of select="."/>
<xsl:copy-of select="following-sibling::reporting-org[1]"/>
<xsl:copy-of select="following-sibling::narrative[1]"/>
<xsl:copy-of select="following-sibling::title[1]"/>
<xsl:copy-of select="following-sibling::description[1]"/>
</iati-activity>
</xsl:template>
</xsl:stylesheet>
Python
import lxml.etree as ET
# LOAD XML AND XSL SCRIPT
xml = ET.parse('Input.xml')
xsl = ET.parse('Script.xsl')
transform = ET.XSLT(xsl)
# LOOP THROUGH ALL NODE COUNTS AND PASS PARAMETER TO XSLT
iati_count = len(xml.xpath('//iati-identifier'))
for i in range(iati_count):
n = ET.XSLT.strparam(str(i+1))
result = transform(xml, item_num=n) # NAME OF XSL PARAMETER
# SAVE XML TO FILE
with open('Output_{}.xml'.format(i+1), 'wb') as f:
f.write(result)
产出
Output_1.xml
<?xml version="1.0"?>
<iati-activity>
<iati-identifier>XM-DAC-6-4-011077</iati-identifier>
<reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
<narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
<narrative>Italian Agency for Development Cooperation</narrative>
</reporting-org>
<title>
<narrative>Protracted relief and recovery operation</narrative>
<narrative xml:lang="it">Protracted relief and recovery operation </narrative>
</title>
<description>
<narrative>Protracted relief and recovery operation</narrative>
</description>
</iati-activity>
Output_2.xml
<?xml version="1.0"?>
<iati-activity>
<iati-identifier>XM-DAC-6-4-011077-01-0</iati-identifier>
<reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
<narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
<narrative>Italian Agency for Development Cooperation</narrative>
</reporting-org>
<title>
<narrative>Protracted relief and recovery operation</narrative>
<narrative xml:lang="it">Protracted relief and recovery operation</narrative>
</title>
<description>
<narrative>The scope of the program is to support the population on food security and resilience. In particular, to support local agricultural products and vulnerable families on food security.</narrative>
</description>
</iati-activity>