使用 lxml 将 XML 片段插入到 XML 文档中
Inserting XML fragments into an XML document with lxml
我有一组 XML 个文件要合并在一起。主 XML 文档是完整的 ISO 19139 XML 文档,另外两个 XML 文件可能包含 <gmd:descriptiveKeywords>
元素。我需要从片段文件中提取任何这些 <gmd:descriptiveKeywords>
元素并添加到主文件中。这些文件集有数百个,因此我需要进行一些匹配以确保我组合了正确的数据集。
片段 XML 文件可能如下所示:
<?xml version="1.0" encoding="UTF-8"?>
<ValueSupplyChain xmlns:gmd="http://www.isotc211.org/2005/gmd"
xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:gmx="http://www.isotc211.org/2005/gmx"
xmlns:xlink="http://www.w3.org/1999/xlink" id="MICA_B1v-101"
title="MINERALS4EU-EU MINERALS KNOWLEDGE DATA PLATFORM (EU-MKDP)">
<gmd:descriptiveKeywords>
<gmd:MD_Keywords id="exploration">
<gmd:keyword>
<gco:CharacterString>Exploration</gco:CharacterString>
</gmd:keyword>
<gmd:thesaurusName>
<gmd:CI_Citation>
<gmd:title>
<gco:CharacterString>MICA ontology
(ValueSupplyChainScheme)</gco:CharacterString>
</gmd:title>
<gmd:date gco:nilReason="unknown"/>
<gmd:edition>
<gco:CharacterString>2</gco:CharacterString>
</gmd:edition>
<gmd:identifier>
<gmd:MD_Identifier>
<gmd:code>
<gmx:Anchor
xlink:href="https://w3id.org/mica/ontology/MicaOntology/7418a9ae1cd44847889c2c92408e1e71"
/>
</gmd:code>
</gmd:MD_Identifier>
</gmd:identifier>
</gmd:CI_Citation>
</gmd:thesaurusName>
</gmd:MD_Keywords>
</gmd:descriptiveKeywords>
</ValueSupplyChain>
大师 XML 的结构如下(使用图像作为 XML 可以变得相当大):
理想情况下,我想将相关片段部分附加到现有关键字部分下方,并创建一个新的主文档。
我的问题是,虽然我似乎能够匹配正确的数据集,并找到相关部分,但更改 我认为 我做的永远不会被写入输出目标文件。
我的代码是:
import logging
import platform
import glob
import os
from lxml import etree as et
logging.getLogger().setLevel(logging.DEBUG)
PC_name = platform.node()
if PC_name == 'blah ':
root_directory = "blah\blah\outputs\"
dir_sep = "\"
else:
root_directory = "C:\Temp\"
dir_sep = "\"
batch_directory_name = "Batch1"
batch_number = "1"
in_directory = root_directory + batch_directory_name
out_directory_name = "splodge"
out_directory = in_directory + dir_sep + out_directory_name
if not os.path.exists(out_directory):
os.makedirs(out_directory)
os.chdir(in_directory)
fileSuffix = ".xml"
globDirSep = "/"
fileTStem = "T" + batch_number + "_"
fileDStem = "D" + batch_number + "_"
fileVStem = "V" + batch_number + "_"
fileTPattern = fileTStem + "[0-9]*" + fileSuffix
globTPattern = in_directory + globDirSep + fileTPattern
stem = in_directory + dir_sep + fileTStem
ns_all = {'gmd': 'http://www.isotc211.org/2005/gmd',
'gco': 'http://www.isotc211.org/2005/gco',
'gmx': 'http://www.isotc211.org/2005/gmx',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
'gml': 'http://www.opengis.net/gml',
'xlink': 'http://www.w3.org/1999/xlink',
'geonet': 'http://www.fao.org/geonetwork'}
record_title = \
'gmd:identificationInfo/gmd:MD_DataIdentification/gmd:citation/gmd:CI_Citation/gmd:title/gco:CharacterString'
record_keywords = 'gmd:identificationInfo/gmd:MD_DataIdentification/gmd:descriptiveKeywords'
for file in glob.glob(globTPattern):
'Get the record number of the current T file'
fnum = file.replace(stem, "").replace(fileSuffix, "")
tree = et.parse(file)
root = tree.getroot()
recordT = root.find(record_title, ns_all)
'We want to use the UPPER case version to compare with D and V file titles'
RecordTitle = recordT.text.upper()
logging.debug("T title: " + RecordTitle)
dFile = in_directory + dir_sep + fileDStem + fnum + fileSuffix
vFile = in_directory + dir_sep + fileVStem + fnum + fileSuffix
'Find keyword sections in T file (and how many for interest...)'
keywordList = root.findall(record_keywords, ns_all)
knum = len(keywordList)
logging.debug("T file has the following number of gmd:descriptiveKeywords sections: " + str(knum))
try:
dTree = et.parse(dFile)
dRoot = dTree.getroot()
recordDT = dRoot.attrib['title']
logging.debug("D title: " + recordDT)
if RecordTitle == recordDT:
logging.debug("T and D titles are the same, we can continue...")
'If the titles match then we can insert the D keywords fragment'
DKeywords = dRoot.findall('gmd:descriptiveKeywords', ns_all)
dnum = len(DKeywords)
logging.debug("D file has the following number of gmd:descriptiveKeywords sections: " + str(dnum))
keywordList.extend(DKeywords)
logging.debug("Subtotal: " + str(len(keywordList)))
else:
logging.debug("T and D titles don't match")
except:
logging.debug("Cannot parse: " + dFile)
try:
vTree = et.parse(vFile)
vRoot = vTree.getroot()
recordVT = vRoot.attrib['title']
logging.debug("V title: " + recordVT)
if RecordTitle == recordVT:
logging.debug("T and V titles are the same, we can continue...")
'If the titles match then we can insert the V keywords fragment'
VKeywords = vRoot.findall('gmd:descriptiveKeywords', ns_all)
vnum = len(VKeywords)
logging.debug("V file has the following number of gmd:descriptiveKeywords sections: " + str(vnum))
keywordList.extend(VKeywords)
logging.debug("Subtotal: " + str(len(keywordList)))
else:
logging.debug("T and V titles don't match")
except:
logging.debug("Cannot parse: " + vFile)
newFile = "out" + batch_number + "_" + fnum + fileSuffix
writeTo = out_directory_name + dir_sep + newFile
tree.write(writeTo)
调试输出如下:
DEBUG:root:T title: BGR BOREHOLE MAP
DEBUG:root:T file has the following number of gmd:descriptiveKeywords sections: 7
DEBUG:root:D title: BGR BOREHOLE MAP
DEBUG:root:T and D titles are the same, we can continue...
DEBUG:root:D file has the following number of gmd:descriptiveKeywords sections: 5
DEBUG:root:Subtotal: 12
DEBUG:root:V title: BGR BOREHOLE MAP
DEBUG:root:T and V titles are the same, we can continue...
DEBUG:root:V file has the following number of gmd:descriptiveKeywords sections: 1
DEBUG:root:Subtotal: 13
DEBUG:root:T title: 3D, 4D AND PREDICTIVE MODELLING OF MAJOR MINERAL BELTS IN EUROPE
DEBUG:root:T file has the following number of gmd:descriptiveKeywords sections: 36
DEBUG:root:D title: 3D, 4D AND PREDICTIVE MODELLING OF MAJOR MINERAL BELTS IN EUROPE
DEBUG:root:T and D titles are the same, we can continue...
DEBUG:root:D file has the following number of gmd:descriptiveKeywords sections: 5
DEBUG:root:Subtotal: 41
从调试信息看来,我已成功添加到 gmd:descriptiveKeywords 元素,列表长度按预期增加,但正如我写出 XML 时所说,我' m 获取原始主文件的内容。
我也尝试过使用 ElementTree,但我遇到了同样的问题;此外,输出不支持主控中使用的命名空间前缀。
我做错了什么?
编辑
重现问题的最少代码如下:
from lxml import etree as et
# Open the master file, which is a well-formed and schema valid ISO 19139 XML record
tree = et.parse('T1_0.xml')
root = tree.getroot()
ns_all = {'gmd': 'http://www.isotc211.org/2005/gmd',
'gco': 'http://www.isotc211.org/2005/gco',
'gmx': 'http://www.isotc211.org/2005/gmx',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
'gml': 'http://www.opengis.net/gml',
'xlink': 'http://www.w3.org/1999/xlink',
'geonet': 'http://www.fao.org/geonetwork'}
keywordList = root.findall('gmd:identificationInfo/gmd:MD_DataIdentification/gmd:descriptiveKeywords', ns_all)
# Just a quick check that everything works as expected
print(len(keywordList)) # Should return 7 for the master file
# Open a well-formed XML file containing content we wish to add to the (or a copy of the) master record
dTree = et.parse('D1_0.xml')
dRoot = dTree.getroot()
DKeywords = dRoot.findall('gmd:descriptiveKeywords', ns_all)
# Just a quick check that everything works as expected
print(len(DKeywords)) # Should return 5 for the D file
# Add the keywords from the second file to the keywords of the master file
keywordList.extend(DKeywords)
# We've added 5 records so the result should be 12
print(len(keywordList)) # I get 12 here
# Write out the new file
tree.write('combinedTD1_0.xml')
# If all worked as expected the new file should have 12
ctree = et.parse('combinedTD1_0.xml')
croot = ctree.getroot()
CKeywords = croot.findall('gmd:identificationInfo/gmd:MD_DataIdentification/gmd:descriptiveKeywords', ns_all)
print(len(CKeywords)) # I get 7 :(
文件是:
主文件示例:T1_0.xml
片段文件示例:D1_0.xml
片段文件示例:V1_0.xml
keywordList.extend(DKeywords)
只是将元素添加到列表中。此操作不会对 XML 树执行任何操作。
要将其他 descriptiveKeywords
节点作为主文档中节点的兄弟节点插入,您可以执行以下操作:
# Get the last of the descriptiveKeywords nodes in the master document
last_kw = keywordList[-1]
# Get the node's parent and its position (index) within the parent
kw_parent = last_kw.getparent()
ix = kw_parent.index(last_kw)
# Insert the descriptiveKeyword nodes from the fragment file as successive siblings
for dk in DKeywords:
kw_parent.insert(ix+1, dk)
ix += 1
正如我已经回答过很多很多次的那样,当需要像合并文档一样操作 XML 文件时,请考虑 XSLT,一种旨在转换 XML 文件的专用语言。 Python 的 lxml
模块可以 运行 XSLT 1.0 脚本。
具体来说,XSLT 维护 document()
函数,您可以通过该函数传递文件名参数以将片段节点附加到现有主节点。此外,XSLT 使用 Identity Transform to copy entire document as is with Muenchian Grouping 通过不同的 关键字 来索引文档。使用这种方法,唯一需要的 for
循环是遍历文件。
由于 OP 没有设置可重现的示例,下面是使用 Whosebug 在 python and xslt 标签中的前 3 名用户的示例演示。主文件从前 1 名开始。 Python 脚本然后迭代以追加第 2 等级然后第 3 等级 <tag1>
:
大师XML(排名前1的用户)
<?xml version="1.0"?>
<Whosebug>
<group lang="python">
<topusers>
<user>Martijn Pieters</user>
<link>https://whosebug.com/users/100297/martijn-pieters</link>
<location>Cambridge, United Kingdom </location>
<year_rep>70,404</year_rep>
<total_rep>590,309</total_rep>
<tag1>python</tag1>
<tag2>python-3.x</tag2>
<tag3>python-2.7</tag3>
</topusers>
</group>
<group lang="xslt">
<topusers>
<user>Dimitre Novatchev</user>
<link>https://whosebug.com/users/36305/dimitre-novatchev</link>
<location>United States</location>
<year_rep>9,922</year_rep>
<total_rep>197,245</total_rep>
<tag1>xslt</tag1>
<tag2>xml</tag2>
<tag3>xpath</tag3>
</topusers>
</group>
</Whosebug>
等级 2 XML (即片段)
<?xml version="1.0" encoding="utf-8"?>
<Whosebug>
<group lang="python">
<topusers>
<user>Alex Martelli</user>
<link>https://whosebug.com/users/95810/alex-martelli</link>
<location>Sunnyvale, CA</location>
<year_rep>49,172</year_rep>
<total_rep>540,372</total_rep>
<tag1>python</tag1>
<tag2>list</tag2>
<tag3>c++</tag3>
</topusers>
</group>
<group lang="python">
<topusers>
<user>Martin Honnen</user>
<link>https://whosebug.com/users/252228/martin-honnen</link>
<location>Germany</location>
<year_rep>10,046</year_rep>
<total_rep>92,604</total_rep>
<tag1>xslt</tag1>
<tag2>xml</tag2>
<tag3>xpath</tag3>
</topusers>
</group>
</Whosebug>
等级3 XML (即片段)
<?xml version="1.0" encoding="utf-8"?>
<Whosebug>
<group lang="python">
<topusers>
<user>unutbu</user>
<link>https://whosebug.com/users/190597/unutbu</link>
<location></location>
<year_rep>55,492</year_rep>
<total_rep>453,267</total_rep>
<tag1>python</tag1>
<tag2>pandas</tag2>
<tag3>numpy</tag3>
</topusers>
</group>
<group lang="xslt">
<topusers>
<user>michael.hor257k</user>
<link>https://whosebug.com/users/3016153/michael-hor257k</link>
<location></location>
<year_rep>11,339</year_rep>
<total_rep>70,473</total_rep>
<tag1>xslt</tag1>
<tag2>xml</tag2>
<tag3>xslt-1.0</tag3>
</topusers>
</group>
</Whosebug>
XSLT (作为 .xsl 文件保存在与 .xml 文件相同的目录中)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" omit_xml_declaration="no"/>
<xsl:strip-space elements="*"/>
<xsl:key name="keyid" match="topusers" use="tag1" />
<xsl:param name="fragment" />
<!-- IDENTITY TRANSFORM -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- IMPORT XML FRAGMENT -->
<xsl:template match="group">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates select="topusers[generate-id() = generate-id(key('keyid', tag1))]"/>
</xsl:copy>
</xsl:template>
<!-- COPY EXISTING topusers AND APPEND EXTERNAL topusers BY SAME KEYWORD -->
<xsl:template match="topusers">
<xsl:variable select="tag1" name="keyword"/>
<xsl:for-each select="key('keyid', tag1)">
<xsl:copy-of select="."/>
</xsl:for-each>
<!-- PASS PYTHON PARAM INTO document() -->
<xsl:copy-of select="document($fragment)/Whosebug/group/topusers[tag1=$keyword]"/>
</xsl:template>
</xsl:stylesheet>
Python (解析所有xml和xsl文件)
import os
import lxml.etree as et
# CURRENT DIRECTORY OF SCRIPT
cd = os.path.dirname(os.path.abspath(__file__))
master = os.path.join(cd, 'Master.xml')
# LOAD XSL SCRIPT
xsl = et.parse(os.path.join(cd, 'XSLTScript.xsl'))
transform = et.XSLT(xsl)
# ITERATE THROUGH FRAGMENT XML FILES IN DIRECTORY
for f in sorted(os.listdir(cd)):
if f.endswith('.xml'):
# LOAD MASTER XML
doc = et.parse(master)
print(f)
# PASS FILE NAME AS PARAMETER FOR XSLT's document()
n = et.XSLT.strparam(f)
result = transform(doc, fragment=n)
# UPDATE MASTER XML
with open(master, 'wb') as s:
s.write(result)
输出 (每个标签排名前3)
<?xml version="1.0"?>
<Whosebug>
<group lang="python">
<topusers>
<user>Martijn Pieters</user>
<link>https://whosebug.com/users/100297/martijn-pieters</link>
<location>Cambridge, United Kingdom </location>
<year_rep>70,404</year_rep>
<total_rep>590,309</total_rep>
<tag1>python</tag1>
<tag2>python-3.x</tag2>
<tag3>python-2.7</tag3>
</topusers>
<topusers>
<user>Alex Martelli</user>
<link>https://whosebug.com/users/95810/alex-martelli</link>
<location>Sunnyvale, CA</location>
<year_rep>49,172</year_rep>
<total_rep>540,372</total_rep>
<tag1>python</tag1>
<tag2>list</tag2>
<tag3>c++</tag3>
</topusers>
<topusers>
<user>unutbu</user>
<link>https://whosebug.com/users/190597/unutbu</link>
<location/>
<year_rep>55,492</year_rep>
<total_rep>453,267</total_rep>
<tag1>python</tag1>
<tag2>pandas</tag2>
<tag3>numpy</tag3>
</topusers>
</group>
<group lang="xslt">
<topusers>
<user>Dimitre Novatchev</user>
<link>https://whosebug.com/users/36305/dimitre-novatchev</link>
<location>United States</location>
<year_rep>9,922</year_rep>
<total_rep>197,245</total_rep>
<tag1>xslt</tag1>
<tag2>xml</tag2>
<tag3>xpath</tag3>
</topusers>
<topusers>
<user>Martin Honnen</user>
<link>https://whosebug.com/users/252228/martin-honnen</link>
<location>Germany</location>
<year_rep>10,046</year_rep>
<total_rep>92,604</total_rep>
<tag1>xslt</tag1>
<tag2>xml</tag2>
<tag3>xpath</tag3>
</topusers>
<topusers>
<user>michael.hor257k</user>
<link>https://whosebug.com/users/3016153/michael-hor257k</link>
<location/>
<year_rep>11,339</year_rep>
<total_rep>70,473</total_rep>
<tag1>xslt</tag1>
<tag2>xml</tag2>
<tag3>xslt-1.0</tag3>
</topusers>
</group>
</Whosebug>
OP XSLT
适合 OP 的实际主文件和片段文件的相应 XSLT 可能看起来像这个未经测试的版本。下面假设 关键字 与发布的碎片处于相同的布局(无法分辨,因为图像关闭了 <gmd:descriptiveKeywords>
节点):
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:gmd="http://www.isotc211.org/2005/gmd"
xmlns:gco="http://www.isotc211.org/2005/gco"
xmlns:gmx="http://www.isotc211.org/2005/gmx"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:gml="http://www.opengis.net/gml"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:geonet="http://www.fao.org/geonetwork">
<xsl:output indent="yes" omit_xml_declaration="no"/>
<xsl:strip-space elements="*"/>
<xsl:key name="keyid" match="gmd:MD_Keywords" use="gmd:keyword/gco:CharacterString" />
<xsl:param name="fragment" />
<!-- IDENTITY TRANSFORM -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- IMPORT XML FRAGMENT -->
<xsl:template match="gmd:descriptiveKeywords">
<xsl:copy>
<xsl:apply-templates select="gmd:MD_Keywords[generate-id() = generate-id(key('keyid', gmd:keyword/gco:CharacterString))]"/>
</xsl:copy>
</xsl:template>
<!-- COPY EXISTING gmd:MD_Keywords AND APPEND EXTERNAL gmd:MD_Keywords BY SAME KEYWORD -->
<xsl:template match="gmd:MD_Keywords">
<xsl:variable select="gmd:keyword/gco:CharacterString" name="keyword"/>
<xsl:for-each select="key('keyid', gmd:keyword/gco:CharacterString)">
<xsl:copy-of select="."/>
</xsl:for-each>
<!-- PASS PYTHON PARAM INTO document() -->
<xsl:copy-of select="document($fragment)/ValueSupplyChain/gmd:descriptiveKeywords/gmd:MD_Keywords[gmd:keyword/gco:CharacterString=$keyword]"/>
</xsl:template>
</xsl:stylesheet>
我有一组 XML 个文件要合并在一起。主 XML 文档是完整的 ISO 19139 XML 文档,另外两个 XML 文件可能包含 <gmd:descriptiveKeywords>
元素。我需要从片段文件中提取任何这些 <gmd:descriptiveKeywords>
元素并添加到主文件中。这些文件集有数百个,因此我需要进行一些匹配以确保我组合了正确的数据集。
片段 XML 文件可能如下所示:
<?xml version="1.0" encoding="UTF-8"?>
<ValueSupplyChain xmlns:gmd="http://www.isotc211.org/2005/gmd"
xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:gmx="http://www.isotc211.org/2005/gmx"
xmlns:xlink="http://www.w3.org/1999/xlink" id="MICA_B1v-101"
title="MINERALS4EU-EU MINERALS KNOWLEDGE DATA PLATFORM (EU-MKDP)">
<gmd:descriptiveKeywords>
<gmd:MD_Keywords id="exploration">
<gmd:keyword>
<gco:CharacterString>Exploration</gco:CharacterString>
</gmd:keyword>
<gmd:thesaurusName>
<gmd:CI_Citation>
<gmd:title>
<gco:CharacterString>MICA ontology
(ValueSupplyChainScheme)</gco:CharacterString>
</gmd:title>
<gmd:date gco:nilReason="unknown"/>
<gmd:edition>
<gco:CharacterString>2</gco:CharacterString>
</gmd:edition>
<gmd:identifier>
<gmd:MD_Identifier>
<gmd:code>
<gmx:Anchor
xlink:href="https://w3id.org/mica/ontology/MicaOntology/7418a9ae1cd44847889c2c92408e1e71"
/>
</gmd:code>
</gmd:MD_Identifier>
</gmd:identifier>
</gmd:CI_Citation>
</gmd:thesaurusName>
</gmd:MD_Keywords>
</gmd:descriptiveKeywords>
</ValueSupplyChain>
大师 XML 的结构如下(使用图像作为 XML 可以变得相当大):
理想情况下,我想将相关片段部分附加到现有关键字部分下方,并创建一个新的主文档。
我的问题是,虽然我似乎能够匹配正确的数据集,并找到相关部分,但更改 我认为 我做的永远不会被写入输出目标文件。
我的代码是:
import logging
import platform
import glob
import os
from lxml import etree as et
logging.getLogger().setLevel(logging.DEBUG)
PC_name = platform.node()
if PC_name == 'blah ':
root_directory = "blah\blah\outputs\"
dir_sep = "\"
else:
root_directory = "C:\Temp\"
dir_sep = "\"
batch_directory_name = "Batch1"
batch_number = "1"
in_directory = root_directory + batch_directory_name
out_directory_name = "splodge"
out_directory = in_directory + dir_sep + out_directory_name
if not os.path.exists(out_directory):
os.makedirs(out_directory)
os.chdir(in_directory)
fileSuffix = ".xml"
globDirSep = "/"
fileTStem = "T" + batch_number + "_"
fileDStem = "D" + batch_number + "_"
fileVStem = "V" + batch_number + "_"
fileTPattern = fileTStem + "[0-9]*" + fileSuffix
globTPattern = in_directory + globDirSep + fileTPattern
stem = in_directory + dir_sep + fileTStem
ns_all = {'gmd': 'http://www.isotc211.org/2005/gmd',
'gco': 'http://www.isotc211.org/2005/gco',
'gmx': 'http://www.isotc211.org/2005/gmx',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
'gml': 'http://www.opengis.net/gml',
'xlink': 'http://www.w3.org/1999/xlink',
'geonet': 'http://www.fao.org/geonetwork'}
record_title = \
'gmd:identificationInfo/gmd:MD_DataIdentification/gmd:citation/gmd:CI_Citation/gmd:title/gco:CharacterString'
record_keywords = 'gmd:identificationInfo/gmd:MD_DataIdentification/gmd:descriptiveKeywords'
for file in glob.glob(globTPattern):
'Get the record number of the current T file'
fnum = file.replace(stem, "").replace(fileSuffix, "")
tree = et.parse(file)
root = tree.getroot()
recordT = root.find(record_title, ns_all)
'We want to use the UPPER case version to compare with D and V file titles'
RecordTitle = recordT.text.upper()
logging.debug("T title: " + RecordTitle)
dFile = in_directory + dir_sep + fileDStem + fnum + fileSuffix
vFile = in_directory + dir_sep + fileVStem + fnum + fileSuffix
'Find keyword sections in T file (and how many for interest...)'
keywordList = root.findall(record_keywords, ns_all)
knum = len(keywordList)
logging.debug("T file has the following number of gmd:descriptiveKeywords sections: " + str(knum))
try:
dTree = et.parse(dFile)
dRoot = dTree.getroot()
recordDT = dRoot.attrib['title']
logging.debug("D title: " + recordDT)
if RecordTitle == recordDT:
logging.debug("T and D titles are the same, we can continue...")
'If the titles match then we can insert the D keywords fragment'
DKeywords = dRoot.findall('gmd:descriptiveKeywords', ns_all)
dnum = len(DKeywords)
logging.debug("D file has the following number of gmd:descriptiveKeywords sections: " + str(dnum))
keywordList.extend(DKeywords)
logging.debug("Subtotal: " + str(len(keywordList)))
else:
logging.debug("T and D titles don't match")
except:
logging.debug("Cannot parse: " + dFile)
try:
vTree = et.parse(vFile)
vRoot = vTree.getroot()
recordVT = vRoot.attrib['title']
logging.debug("V title: " + recordVT)
if RecordTitle == recordVT:
logging.debug("T and V titles are the same, we can continue...")
'If the titles match then we can insert the V keywords fragment'
VKeywords = vRoot.findall('gmd:descriptiveKeywords', ns_all)
vnum = len(VKeywords)
logging.debug("V file has the following number of gmd:descriptiveKeywords sections: " + str(vnum))
keywordList.extend(VKeywords)
logging.debug("Subtotal: " + str(len(keywordList)))
else:
logging.debug("T and V titles don't match")
except:
logging.debug("Cannot parse: " + vFile)
newFile = "out" + batch_number + "_" + fnum + fileSuffix
writeTo = out_directory_name + dir_sep + newFile
tree.write(writeTo)
调试输出如下:
DEBUG:root:T title: BGR BOREHOLE MAP
DEBUG:root:T file has the following number of gmd:descriptiveKeywords sections: 7
DEBUG:root:D title: BGR BOREHOLE MAP
DEBUG:root:T and D titles are the same, we can continue...
DEBUG:root:D file has the following number of gmd:descriptiveKeywords sections: 5
DEBUG:root:Subtotal: 12
DEBUG:root:V title: BGR BOREHOLE MAP
DEBUG:root:T and V titles are the same, we can continue...
DEBUG:root:V file has the following number of gmd:descriptiveKeywords sections: 1
DEBUG:root:Subtotal: 13
DEBUG:root:T title: 3D, 4D AND PREDICTIVE MODELLING OF MAJOR MINERAL BELTS IN EUROPE
DEBUG:root:T file has the following number of gmd:descriptiveKeywords sections: 36
DEBUG:root:D title: 3D, 4D AND PREDICTIVE MODELLING OF MAJOR MINERAL BELTS IN EUROPE
DEBUG:root:T and D titles are the same, we can continue...
DEBUG:root:D file has the following number of gmd:descriptiveKeywords sections: 5
DEBUG:root:Subtotal: 41
从调试信息看来,我已成功添加到 gmd:descriptiveKeywords 元素,列表长度按预期增加,但正如我写出 XML 时所说,我' m 获取原始主文件的内容。
我也尝试过使用 ElementTree,但我遇到了同样的问题;此外,输出不支持主控中使用的命名空间前缀。
我做错了什么?
编辑
重现问题的最少代码如下:
from lxml import etree as et
# Open the master file, which is a well-formed and schema valid ISO 19139 XML record
tree = et.parse('T1_0.xml')
root = tree.getroot()
ns_all = {'gmd': 'http://www.isotc211.org/2005/gmd',
'gco': 'http://www.isotc211.org/2005/gco',
'gmx': 'http://www.isotc211.org/2005/gmx',
'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
'gml': 'http://www.opengis.net/gml',
'xlink': 'http://www.w3.org/1999/xlink',
'geonet': 'http://www.fao.org/geonetwork'}
keywordList = root.findall('gmd:identificationInfo/gmd:MD_DataIdentification/gmd:descriptiveKeywords', ns_all)
# Just a quick check that everything works as expected
print(len(keywordList)) # Should return 7 for the master file
# Open a well-formed XML file containing content we wish to add to the (or a copy of the) master record
dTree = et.parse('D1_0.xml')
dRoot = dTree.getroot()
DKeywords = dRoot.findall('gmd:descriptiveKeywords', ns_all)
# Just a quick check that everything works as expected
print(len(DKeywords)) # Should return 5 for the D file
# Add the keywords from the second file to the keywords of the master file
keywordList.extend(DKeywords)
# We've added 5 records so the result should be 12
print(len(keywordList)) # I get 12 here
# Write out the new file
tree.write('combinedTD1_0.xml')
# If all worked as expected the new file should have 12
ctree = et.parse('combinedTD1_0.xml')
croot = ctree.getroot()
CKeywords = croot.findall('gmd:identificationInfo/gmd:MD_DataIdentification/gmd:descriptiveKeywords', ns_all)
print(len(CKeywords)) # I get 7 :(
文件是:
主文件示例:T1_0.xml
片段文件示例:D1_0.xml
片段文件示例:V1_0.xml
keywordList.extend(DKeywords)
只是将元素添加到列表中。此操作不会对 XML 树执行任何操作。
要将其他 descriptiveKeywords
节点作为主文档中节点的兄弟节点插入,您可以执行以下操作:
# Get the last of the descriptiveKeywords nodes in the master document
last_kw = keywordList[-1]
# Get the node's parent and its position (index) within the parent
kw_parent = last_kw.getparent()
ix = kw_parent.index(last_kw)
# Insert the descriptiveKeyword nodes from the fragment file as successive siblings
for dk in DKeywords:
kw_parent.insert(ix+1, dk)
ix += 1
正如我已经回答过很多很多次的那样,当需要像合并文档一样操作 XML 文件时,请考虑 XSLT,一种旨在转换 XML 文件的专用语言。 Python 的 lxml
模块可以 运行 XSLT 1.0 脚本。
具体来说,XSLT 维护 document()
函数,您可以通过该函数传递文件名参数以将片段节点附加到现有主节点。此外,XSLT 使用 Identity Transform to copy entire document as is with Muenchian Grouping 通过不同的 关键字 来索引文档。使用这种方法,唯一需要的 for
循环是遍历文件。
由于 OP 没有设置可重现的示例,下面是使用 Whosebug 在 python and xslt 标签中的前 3 名用户的示例演示。主文件从前 1 名开始。 Python 脚本然后迭代以追加第 2 等级然后第 3 等级 <tag1>
:
大师XML(排名前1的用户)
<?xml version="1.0"?>
<Whosebug>
<group lang="python">
<topusers>
<user>Martijn Pieters</user>
<link>https://whosebug.com/users/100297/martijn-pieters</link>
<location>Cambridge, United Kingdom </location>
<year_rep>70,404</year_rep>
<total_rep>590,309</total_rep>
<tag1>python</tag1>
<tag2>python-3.x</tag2>
<tag3>python-2.7</tag3>
</topusers>
</group>
<group lang="xslt">
<topusers>
<user>Dimitre Novatchev</user>
<link>https://whosebug.com/users/36305/dimitre-novatchev</link>
<location>United States</location>
<year_rep>9,922</year_rep>
<total_rep>197,245</total_rep>
<tag1>xslt</tag1>
<tag2>xml</tag2>
<tag3>xpath</tag3>
</topusers>
</group>
</Whosebug>
等级 2 XML (即片段)
<?xml version="1.0" encoding="utf-8"?>
<Whosebug>
<group lang="python">
<topusers>
<user>Alex Martelli</user>
<link>https://whosebug.com/users/95810/alex-martelli</link>
<location>Sunnyvale, CA</location>
<year_rep>49,172</year_rep>
<total_rep>540,372</total_rep>
<tag1>python</tag1>
<tag2>list</tag2>
<tag3>c++</tag3>
</topusers>
</group>
<group lang="python">
<topusers>
<user>Martin Honnen</user>
<link>https://whosebug.com/users/252228/martin-honnen</link>
<location>Germany</location>
<year_rep>10,046</year_rep>
<total_rep>92,604</total_rep>
<tag1>xslt</tag1>
<tag2>xml</tag2>
<tag3>xpath</tag3>
</topusers>
</group>
</Whosebug>
等级3 XML (即片段)
<?xml version="1.0" encoding="utf-8"?>
<Whosebug>
<group lang="python">
<topusers>
<user>unutbu</user>
<link>https://whosebug.com/users/190597/unutbu</link>
<location></location>
<year_rep>55,492</year_rep>
<total_rep>453,267</total_rep>
<tag1>python</tag1>
<tag2>pandas</tag2>
<tag3>numpy</tag3>
</topusers>
</group>
<group lang="xslt">
<topusers>
<user>michael.hor257k</user>
<link>https://whosebug.com/users/3016153/michael-hor257k</link>
<location></location>
<year_rep>11,339</year_rep>
<total_rep>70,473</total_rep>
<tag1>xslt</tag1>
<tag2>xml</tag2>
<tag3>xslt-1.0</tag3>
</topusers>
</group>
</Whosebug>
XSLT (作为 .xsl 文件保存在与 .xml 文件相同的目录中)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" omit_xml_declaration="no"/>
<xsl:strip-space elements="*"/>
<xsl:key name="keyid" match="topusers" use="tag1" />
<xsl:param name="fragment" />
<!-- IDENTITY TRANSFORM -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- IMPORT XML FRAGMENT -->
<xsl:template match="group">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates select="topusers[generate-id() = generate-id(key('keyid', tag1))]"/>
</xsl:copy>
</xsl:template>
<!-- COPY EXISTING topusers AND APPEND EXTERNAL topusers BY SAME KEYWORD -->
<xsl:template match="topusers">
<xsl:variable select="tag1" name="keyword"/>
<xsl:for-each select="key('keyid', tag1)">
<xsl:copy-of select="."/>
</xsl:for-each>
<!-- PASS PYTHON PARAM INTO document() -->
<xsl:copy-of select="document($fragment)/Whosebug/group/topusers[tag1=$keyword]"/>
</xsl:template>
</xsl:stylesheet>
Python (解析所有xml和xsl文件)
import os
import lxml.etree as et
# CURRENT DIRECTORY OF SCRIPT
cd = os.path.dirname(os.path.abspath(__file__))
master = os.path.join(cd, 'Master.xml')
# LOAD XSL SCRIPT
xsl = et.parse(os.path.join(cd, 'XSLTScript.xsl'))
transform = et.XSLT(xsl)
# ITERATE THROUGH FRAGMENT XML FILES IN DIRECTORY
for f in sorted(os.listdir(cd)):
if f.endswith('.xml'):
# LOAD MASTER XML
doc = et.parse(master)
print(f)
# PASS FILE NAME AS PARAMETER FOR XSLT's document()
n = et.XSLT.strparam(f)
result = transform(doc, fragment=n)
# UPDATE MASTER XML
with open(master, 'wb') as s:
s.write(result)
输出 (每个标签排名前3)
<?xml version="1.0"?>
<Whosebug>
<group lang="python">
<topusers>
<user>Martijn Pieters</user>
<link>https://whosebug.com/users/100297/martijn-pieters</link>
<location>Cambridge, United Kingdom </location>
<year_rep>70,404</year_rep>
<total_rep>590,309</total_rep>
<tag1>python</tag1>
<tag2>python-3.x</tag2>
<tag3>python-2.7</tag3>
</topusers>
<topusers>
<user>Alex Martelli</user>
<link>https://whosebug.com/users/95810/alex-martelli</link>
<location>Sunnyvale, CA</location>
<year_rep>49,172</year_rep>
<total_rep>540,372</total_rep>
<tag1>python</tag1>
<tag2>list</tag2>
<tag3>c++</tag3>
</topusers>
<topusers>
<user>unutbu</user>
<link>https://whosebug.com/users/190597/unutbu</link>
<location/>
<year_rep>55,492</year_rep>
<total_rep>453,267</total_rep>
<tag1>python</tag1>
<tag2>pandas</tag2>
<tag3>numpy</tag3>
</topusers>
</group>
<group lang="xslt">
<topusers>
<user>Dimitre Novatchev</user>
<link>https://whosebug.com/users/36305/dimitre-novatchev</link>
<location>United States</location>
<year_rep>9,922</year_rep>
<total_rep>197,245</total_rep>
<tag1>xslt</tag1>
<tag2>xml</tag2>
<tag3>xpath</tag3>
</topusers>
<topusers>
<user>Martin Honnen</user>
<link>https://whosebug.com/users/252228/martin-honnen</link>
<location>Germany</location>
<year_rep>10,046</year_rep>
<total_rep>92,604</total_rep>
<tag1>xslt</tag1>
<tag2>xml</tag2>
<tag3>xpath</tag3>
</topusers>
<topusers>
<user>michael.hor257k</user>
<link>https://whosebug.com/users/3016153/michael-hor257k</link>
<location/>
<year_rep>11,339</year_rep>
<total_rep>70,473</total_rep>
<tag1>xslt</tag1>
<tag2>xml</tag2>
<tag3>xslt-1.0</tag3>
</topusers>
</group>
</Whosebug>
OP XSLT
适合 OP 的实际主文件和片段文件的相应 XSLT 可能看起来像这个未经测试的版本。下面假设 关键字 与发布的碎片处于相同的布局(无法分辨,因为图像关闭了 <gmd:descriptiveKeywords>
节点):
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:gmd="http://www.isotc211.org/2005/gmd"
xmlns:gco="http://www.isotc211.org/2005/gco"
xmlns:gmx="http://www.isotc211.org/2005/gmx"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:gml="http://www.opengis.net/gml"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:geonet="http://www.fao.org/geonetwork">
<xsl:output indent="yes" omit_xml_declaration="no"/>
<xsl:strip-space elements="*"/>
<xsl:key name="keyid" match="gmd:MD_Keywords" use="gmd:keyword/gco:CharacterString" />
<xsl:param name="fragment" />
<!-- IDENTITY TRANSFORM -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- IMPORT XML FRAGMENT -->
<xsl:template match="gmd:descriptiveKeywords">
<xsl:copy>
<xsl:apply-templates select="gmd:MD_Keywords[generate-id() = generate-id(key('keyid', gmd:keyword/gco:CharacterString))]"/>
</xsl:copy>
</xsl:template>
<!-- COPY EXISTING gmd:MD_Keywords AND APPEND EXTERNAL gmd:MD_Keywords BY SAME KEYWORD -->
<xsl:template match="gmd:MD_Keywords">
<xsl:variable select="gmd:keyword/gco:CharacterString" name="keyword"/>
<xsl:for-each select="key('keyid', gmd:keyword/gco:CharacterString)">
<xsl:copy-of select="."/>
</xsl:for-each>
<!-- PASS PYTHON PARAM INTO document() -->
<xsl:copy-of select="document($fragment)/ValueSupplyChain/gmd:descriptiveKeywords/gmd:MD_Keywords[gmd:keyword/gco:CharacterString=$keyword]"/>
</xsl:template>
</xsl:stylesheet>