遍历特定 XML 元素和命名空间问题

Iterating though specific XML elements and namespace issues

我想读取包含元数据的 XML 文件并提取特定部分,然后将其写入另一个文件。但是,我在解析 2MB 元数据 XML 文件的开始时卡住了。

出于测试和调试目的,我已将输入文件缩小到下面这个较小的示例 XML。

<?xml version="1.0" encoding="UTF-8"?>
<ODM Description="Study Metadata" xmlns="http://www.cdisc.org/ns/odm/v1.3" xmlns:OpenClinica="http://www.openclinica.org/ns/odm_ext_v130/v3.1" >
    <Study OID="MyStudy">
        <GlobalVariables>
            <StudyName>MyStudy</StudyName>
            <ProtocolName>MyProtocol</ProtocolName>
        </GlobalVariables>
        <BasicDefinitions>
            <MeasurementUnit OID="MU_CM" Name="cm">
                <Symbol>
                    <TranslatedText>cm</TranslatedText>
                </Symbol>
            </MeasurementUnit>
            <MeasurementUnit OID="MU_KG" Name="kg">
                <Symbol>
                    <TranslatedText>kg</TranslatedText>
                </Symbol>
            </MeasurementUnit>
        </BasicDefinitions>
        <MetaDataVersion OID="v1.0.0" Name="MetaDataVersion_v1.0.0">
            <Protocol>
                <StudyEventRef StudyEventOID="SE_BASELINE" OrderNumber="1" Mandatory="Yes"/>
                <StudyEventRef StudyEventOID="SE_3WK" OrderNumber="2" Mandatory="Yes"/>
                <StudyEventRef StudyEventOID="SE_6WK" OrderNumber="3" Mandatory="Yes"/>
                <StudyEventRef StudyEventOID="SE_9WK" OrderNumber="4" Mandatory="Yes"/>
                <StudyEventRef StudyEventOID="SE_12WK" OrderNumber="5" Mandatory="Yes"/>
            </Protocol>
            <ItemDef OID="I_MYSTUDY_B_BL_D_VDATE" Name="BL_D_VISITDATE" DataType="date" SASFieldName="BL_D_VDA" Comment="Visit date" OpenClinica:FormOIDs="F_MYSTUDY_BL_D_2,F_MYSTUDY_BL_D_1">
                <Question>
                    <TranslatedText>Visit date</TranslatedText>
                </Question>
            </ItemDef>
            <ItemDef OID="I_MYSTUDY_B_BL_D_VCODE" Name="BL_D_MEDCODE" DataType="integer" Length="1" SASFieldName="BL_D_MCO" Comment="Medicine code" OpenClinica:FormOIDs="F_MYSTUDY_BL_D_2,F_MYSTUDY_BL_D_1">
                <Question>
                    <TranslatedText>Medicine code</TranslatedText>
                </Question>
                <CodeListRef CodeListOID="CL_12345"/>
            </ItemDef>
        </MetaDataVersion>
    </Study>
</ODM>

我只对 ItemDef 元素及其属性感兴趣,我正在使用 xml.etree.ElementTree 来解析 XML 文件。这是我到目前为止所得到的,但是它从未达到 -- found ItemDef 的部分,请参见下面的代码。

# which file to read
FILE_NAME = "mystudy.xml"
ns = {'d': 'http://www.cdisc.org/ns/odm/v1.3'}

# Import the os module
import os
import xml.etree.ElementTree as ET
import csv
import array as arr

e = ET.parse(os.path.join(os.getcwd(), FILE_NAME))
root = e.getroot()

# testing to see if it is parses anything
print(root.get('Description'))

namespace = "{http://www.cdisc.org/ns/odm/v1.3}"

# none of this seems to work..
# col = e.findall('ItemDef')
# col = e.findall('.//ItemDef')
# col = e.findall('(*)ItemDef')
# col = e.findall('{0}ODM/Study/MetaDataVersion/ItemDef'.format(namespace))
col = e.findall('{0}ODM/{0}Study/{0}MetaDataVersion/{0}ItemDef'.format(namespace))

print("start for-loop")
# iterate all
for itemdef in col:
    name = itemdef.get('Name')
    print("-- found ItemDef name=", name)

print("finished for-loop")

据我了解,您必须正确指定命名空间,否则它什么也不会读取,这可能就是错误所在。 我在 whosebug.com 上搜索了类似的问题并尝试了几件事(请参阅代码中的注释)但它无法正常工作。

由于 e 从根标签开始,从 XPath 表达式中删除 <ODM>

col = e.findall('./{0}Study/{0}MetaDataVersion/{0}ItemDef'.format(namespace))

# Study Metadata
# start for-loop
# -- found ItemDef name= BL_D_VISITDATE
# -- found ItemDef name= BL_D_MEDCODE
# finished for-loop

更好的是,使用 findallnamespaces 参数,使用您定义的字典映射到 d 前缀:

ns = {'d': 'http://www.cdisc.org/ns/odm/v1.3'}

col = e.findall('./d:Study/d:MetaDataVersion/d:ItemDef', namespaces=ns)

# SHORT-HAND FOR ANYWHERE SEARCH
col = e.findall('.//d:ItemDef', namespaces=ns)