使用 Python 和元素树查找和替换标签内的 XML 数据

Finding and replacing XML data inside tags using Python and Element Tree

首先我对python很陌生,知道的很少。然而,我的任务是制作这个程序,所以非常感谢您的帮助。

我需要对 XML 文件中的数据进行匿名处理。这将包括将多个标签更改为 NULL。

我首先尝试使用 python 和元素树来替换 DateOfBirth 数据。我需要将出生日期标签替换为 NULL

这是 XML 文件的片段,其中包含学习者的 MOCK 数据之一。这包括 1 个学习者,通常会有 1-1000 个学习者,所有值都需要始终更改为 NULL。

<?xml version="1.0" encoding="UTF-8"?>
<!-- Please note that this file is properly formed, and serves as an example of a file that will load into the ILR DC system.  The data is anonymised and does not refer to a real-world provider, learning delivery or learner.  Based on the ILR specification, version 2, dated April 2018-->
<Message xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="ESFA/ILR/2018-19" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ESFA/ILR/2018-19">
    <Header>
        <CollectionDetails>
            <Collection>ILR</Collection>
            <Year>1819</Year>
            <FilePreparationDate>2018-01-07</FilePreparationDate>
        </CollectionDetails>
        <Source>
            <ProtectiveMarking>OFFICIAL-SENSITIVE-Personal</ProtectiveMarking>
            <UKPRN>99999999</UKPRN>
            <SoftwareSupplier>SupplierName</SoftwareSupplier>
            <SoftwarePackage>SystemName</SoftwarePackage>
            <Release>1</Release>
            <SerialNo>01</SerialNo>
            <DateTime>2018-06-26T11:14:05</DateTime>
            <!-- This and the next element only appear in files generated by FIS -->
            <ReferenceData>Version5.0, LARS 2017-08-01</ReferenceData>
            <ComponentSetVersion>1</ComponentSetVersion>
        </Source>
    </Header>
    <SourceFiles>
        <!-- The SourceFiles group only appears in files generated by FIS -->
        <SourceFile>
            <SourceFileName>ILR-LLLLLLLL1819-20180626-144401-01.xml</SourceFileName>
            <FilePreparationDate>2018-06-26</FilePreparationDate>
            <SoftwareSupplier>Software Systems Inc.</SoftwareSupplier>
            <SoftwarePackage>GreatStuffMIS</SoftwarePackage>
            <Release>1</Release>
            <SerialNo>01</SerialNo>
            <DateTime>2018-06-26T11:14:05</DateTime>
        </SourceFile>
    </SourceFiles>
    <LearningProvider>
        <UKPRN>99999999</UKPRN>
    </LearningProvider>
    <!-- 16 yr old learner undertaking full time 16-19 (excluding apprenticeships) funded programme -->
    <Learner>
        <LearnRefNumber>16Learner</LearnRefNumber>
        <PMUKPRN>87654321</PMUKPRN>
        <CampId>1234ABCD</CampId>
        <ULN>1061484016</ULN>
        <FamilyName>Smith</FamilyName>
        <GivenNames>Jane</GivenNames>
        <DateOfBirth>1999-02-27</DateOfBirth>
        <Ethnicity>31</Ethnicity>
        <Sex>F</Sex>
        <LLDDHealthProb>2</LLDDHealthProb>
        <Accom>5</Accom>
        <PlanLearnHours>440</PlanLearnHours>
        <PlanEEPHours>100</PlanEEPHours>
        <MathGrade>NONE</MathGrade>
        <EngGrade>D</EngGrade>
        <PostcodePrior>BR1 7SS</PostcodePrior>
        <Postcode>BR1 7SS</Postcode>
        <AddLine1>The Street</AddLine1>
        <AddLine2>ToyTown</AddLine2>
        <LearnerFAM>
            <LearnFAMType>LSR</LearnFAMType>
            <LearnFAMCode>55</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>EDF</LearnFAMType>
            <LearnFAMCode>2</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>MCF</LearnFAMType>
            <LearnFAMCode>3</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>FME</LearnFAMType>
            <LearnFAMCode>2</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>PPE</LearnFAMType>
            <LearnFAMCode>2</LearnFAMCode>
        </LearnerFAM>
        <!-- Employment status record is not required for full time 16-19 (excluding apprenticeships) funded learners  -->
        <!-- 16-19  (excluding apprenticeships) funded study programme -->
        <LearningDelivery>
            <LearnAimRef>50022246</LearnAimRef>
            <AimType>5</AimType>
            <AimSeqNumber>1</AimSeqNumber>
            <LearnStartDate>2015-09-14</LearnStartDate>
            <LearnPlanEndDate>2016-07-02</LearnPlanEndDate>
            <FundModel>25</FundModel>
            <DelLocPostCode>BR1 3RL</DelLocPostCode>
            <CompStatus>1</CompStatus>
            <SWSupAimId>cb5f0d25-cff4-4ea0-92f5-99378cce306d</SWSupAimId>
            <LearningDeliveryFAM>
                <LearnDelFAMType>SOF</LearnDelFAMType>
                <LearnDelFAMCode>107</LearnDelFAMCode>
            </LearningDeliveryFAM>
        </LearningDelivery>
        <LearningDelivery>
            <LearnAimRef>50023408</LearnAimRef>
            <AimType>4</AimType>
            <AimSeqNumber>2</AimSeqNumber>
            <LearnStartDate>2015-02-14</LearnStartDate>
            <LearnPlanEndDate>2016-07-15</LearnPlanEndDate>
            <FundModel>25</FundModel>
            <DelLocPostCode>BR2 7UP</DelLocPostCode>
            <CompStatus>3</CompStatus>
            <LearnActEndDate>2015-04-01</LearnActEndDate>
            <WithdrawReason>98</WithdrawReason>
            <Outcome>3</Outcome>
            <SWSupAimId>c243182a-30af-4879-8f68-3eac708e6bb3</SWSupAimId>
            <LearningDeliveryFAM>
                <LearnDelFAMType>SOF</LearnDelFAMType>
                <LearnDelFAMCode>107</LearnDelFAMCode>
            </LearningDeliveryFAM>
        </LearningDelivery>
    </Learner>

我当前的代码:

import os 
from xml.etree import ElementTree as et 

base_path  = os.path.dirname(os.path.realpath(__file__))

xml_file = os.path.join(base_path, "ILR_mock_data.xml") 

tree = et.parse(xml_file) 

# root = tree.getroot()

# for child in root:
#     print(child.tag, child.attrib)

#for child in root:
#    for element in child:
#        print(element.tag, ":", element.text)


tree.find('Learner/DateOfBirth').text = 'NULL'

tree.wrtie("ILR_Aoned_output.xml") 

错误代码:

 Traceback (most recent call last):
  File "C:/Users/jkay/Desktop/Anon Tool RCU/RCU MOCK TOOL (Anonamising).py", line 20, in <module>
    tree.find('Learner/DateOfBirth').text = 'NULL'
AttributeError: 'NoneType' object has no attribute 'text'

我希望程序 运行 通过 XML 文件和 return 一个所有出生日期都替换为 NULL

的新文件

感谢您的帮助。

Beautiful Soup looks like the solution you're looking for here. It's a library built specifically for parsing HTML and XML files (though you'll maybe have to also install some parsers

应用于您的用例:

from bs4 import BeautifulSoup

with open("my_file.xml", "r") as infile:
    xml_text = infile.read()

soup = BeautifulSoup(xml_text, 'xml')

# replace all DateOfBirth tag contents with NULL
for dob_tag in soup.find_all("DateOfBirth"):
    dob_tag.string = "NULL"

# output and save modified file
with open("my_file_edited.xml", "w") as outfile:
    outfile.write(soup.prettify())

作为奖励,您还可以调整库以轻松替换其他标签,或进行更多 complex/conditional 修改。该工具有很好的文档。

见下文(使用您的 XML 的简化版本)

import xml.etree.ElementTree as ET

xml = '''<r>  <Learner>
        <LearnRefNumber>16Learner</LearnRefNumber>
        <PMUKPRN>87654321</PMUKPRN>
        <CampId>1234ABCD</CampId>
        <ULN>1061484016</ULN>
        <FamilyName>Smith</FamilyName>
        <GivenNames>Jane</GivenNames>
        <DateOfBirth>1999-02-27</DateOfBirth>
        <Ethnicity>31</Ethnicity>
        <Sex>F</Sex>
        <LLDDHealthProb>2</LLDDHealthProb>
        <Accom>5</Accom>
        <PlanLearnHours>440</PlanLearnHours>
        <PlanEEPHours>100</PlanEEPHours>
        <MathGrade>NONE</MathGrade>
        <EngGrade>D</EngGrade>
        <PostcodePrior>BR1 7SS</PostcodePrior>
        <Postcode>BR1 7SS</Postcode>
        <AddLine1>The Street</AddLine1>
        <AddLine2>ToyTown</AddLine2>
        <LearnerFAM>
            <LearnFAMType>LSR</LearnFAMType>
            <LearnFAMCode>55</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>EDF</LearnFAMType>
            <LearnFAMCode>2</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>MCF</LearnFAMType>
            <LearnFAMCode>3</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>FME</LearnFAMType>
            <LearnFAMCode>2</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>PPE</LearnFAMType>
            <LearnFAMCode>2</LearnFAMCode>
        </LearnerFAM>
        <!-- Employment status record is not required for full time 16-19 (excluding apprenticeships) funded learners  -->
        <!-- 16-19  (excluding apprenticeships) funded study programme -->
        <LearningDelivery>
            <LearnAimRef>50022246</LearnAimRef>
            <AimType>5</AimType>
            <AimSeqNumber>1</AimSeqNumber>
            <LearnStartDate>2015-09-14</LearnStartDate>
            <LearnPlanEndDate>2016-07-02</LearnPlanEndDate>
            <FundModel>25</FundModel>
            <DelLocPostCode>BR1 3RL</DelLocPostCode>
            <CompStatus>1</CompStatus>
            <SWSupAimId>cb5f0d25-cff4-4ea0-92f5-99378cce306d</SWSupAimId>
            <LearningDeliveryFAM>
                <LearnDelFAMType>SOF</LearnDelFAMType>
                <LearnDelFAMCode>107</LearnDelFAMCode>
            </LearningDeliveryFAM>
        </LearningDelivery>
        <LearningDelivery>
            <LearnAimRef>50023408</LearnAimRef>
            <AimType>4</AimType>
            <AimSeqNumber>2</AimSeqNumber>
            <LearnStartDate>2015-02-14</LearnStartDate>
            <LearnPlanEndDate>2016-07-15</LearnPlanEndDate>
            <FundModel>25</FundModel>
            <DelLocPostCode>BR2 7UP</DelLocPostCode>
            <CompStatus>3</CompStatus>
            <LearnActEndDate>2015-04-01</LearnActEndDate>
            <WithdrawReason>98</WithdrawReason>
            <Outcome>3</Outcome>
            <SWSupAimId>c243182a-30af-4879-8f68-3eac708e6bb3</SWSupAimId>
            <LearningDeliveryFAM>
                <LearnDelFAMType>SOF</LearnDelFAMType>
                <LearnDelFAMCode>107</LearnDelFAMCode>
            </LearningDeliveryFAM>
        </LearningDelivery>
    </Learner></r>
'''

root = ET.fromstring(xml)
dob_lst = root.findall('.//Learner/DateOfBirth')
for dob in dob_lst:
  dob.text = 'NULL'
ET.dump(root)

您需要找到所有元素 DateOfBirth 并替换每个元素:

for element in tree.findall('.//Learner/DateOfBirth'):
    element.text = 'NULL'