数据框到分层 xml

Question

Read csv to dataframe and then convert that to xml using lxml library

这是我第一次处理 xml，看来有部分成功。任何帮助将不胜感激。

用于创建数据框的 CSV 文件：

Parent,Element,Text,Attribute
,TXLife,"
    ",{'Version': '2.25.00'}
TXLife,UserAuthRequest,"
        ",{}
UserAuthRequest,UserLoginName,*****,{}
UserAuthRequest,UserPswd,"
            ",{}
UserPswd,CryptType,None,{}
UserPswd,Pswd,****,{}
TXLife,TXLifeRequest,"
        ",{'PrimaryObjectID': 'Policy_1'}
TXLifeRequest,TransRefGUID,706D67C1-CC4D-11CF-91FB444554540000,{}
TXLifeRequest,TransType,Holding Change,{'tc': '502'}
TXLifeRequest,TransExeDate,2006-11-19,{}
TXLifeRequest,TransExeTime,13:15:33-07:00,{}
TXLifeRequest,ChangeSubType,"
            ",{}
ChangeSubType,ChangeTC,Change Participant,{'tc': '9'}
TXLifeRequest,OLifE,"
            ",{}
OLifE,Holding,"
                ",{'id': 'Policy_1'}
Holding,HoldingTypeCode,Policy,{'tc': '2'}
Holding,Policy,"
                    ",{}
Policy,PolNumber,1234567,{}
Policy,LineOfBusiness,Annuity,{'tc': '2'}
Policy,Annuity,,{}
OLifE,Party,"
                ",{'id': 'Beneficiary_1'}
Party,PartyTypeCode,Organization,{'tc': '2'}
Party,FullName,The Smith Trust,{}
Party,Organization,"
                    ",{}
Organization,OrgForm,Trust,{'tc': '16'}
Organization,DBA,The Smith Trust,{}
OLifE,Relation,"
                ","{'id': 'Relation_1', 'OriginatingObjectID': 'Policy_1', 'RelatedObjectID': 'Beneficiary_1'}"
Relation,OriginatingObjectType,Holding,{'tc': '4'}
Relation,RelatedObjectType,Party,{'tc': '6'}
Relation,RelationRoleCode,Primary Beneficiary,{'tc': '34'}
Relation,BeneficiaryDesignation,Named,{'tc': '1'}

import lxml.etree as etree
import pandas as pd
import json

# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv') .fillna('NA')
# # Remove rows with comments
# dfc = dfc[~dfc['Element'].str.contains("<cyfunction")].fillna('')
dfc['Attribute'] = dfc['Attribute'].apply(lambda x: x.replace("'", '"'))

# Add the root element for xml
root = etree.Element(dfc['Element'][0])
tree = root.getroottree()

for prnt, elem, txt, attr in dfc[['Parent', 'Element', 'Text', 'Attribute']][1:].values:
    # Convert attributes to json (dictionary)
    attrib = json.loads(attr)
    # list(root) = root.getchildren()
    children = [item for item in str(list(root)).split(' ')]
    rootstring = str(root).split(' ')[1]

#     If the parent is root then add the element as child (appaers to work?)
    if prnt == str(root).split(' ')[1]:
        parent = etree.SubElement(root, elem)

    # If the parent is not root but is one of its children then add the elements to the parent
    elif not prnt == rootstring and prnt in children:
        child = etree.SubElement(parent, elem, attrib).text = txt

#     # If the parent is not in root's descendents then add the childern to the parents
    elif not prnt in [str(item).split(' ') for item in root.iterdescendants()]:
        child = etree.SubElement(parent, elem, attrib).text = txt

print(etree.tostring(tree, pretty_print=True).decode())

实际结果：

<TXLife>
  <UserAuthRequest>
    <UserLoginName>*****</UserLoginName>
    <UserPswd>
            </UserPswd>
    <CryptType>None</CryptType>
    <Pswd>xxxxxx</Pswd>
  </UserAuthRequest>
  <TXLifeRequest>
    <TransRefGUID>706D67C1-CC4D-11CF-91FB444554540000</TransRefGUID>
    <TransType tc="502">Holding Change</TransType>
    <TransExeDate>11/19/2006</TransExeDate>
    <TransExeTime>13:15:33-07:00</TransExeTime>
    <ChangeSubType>
            </ChangeSubType>
    <ChangeTC tc="9">Change Participant</ChangeTC>
    <OLifE>
            </OLifE>
    <Holding id="Policy_1">
                </Holding>
    <HoldingTypeCode tc="2">Policy</HoldingTypeCode>
    <Policy>
                    </Policy>
    <PolNumber>1234567</PolNumber>
    <LineOfBusiness tc="2">Annuity</LineOfBusiness>
    <Annuity>NA</Annuity>
    <Party id="Beneficiary_1">
                </Party>
    <PartyTypeCode tc="2">Organization</PartyTypeCode>
    <FullName>The Smith Trust</FullName>
    <Organization>
                    </Organization>
    <OrgForm tc="16">Trust</OrgForm>
    <DBA>The Smith Trust</DBA>
    <Relation OriginatingObjectID="Policy_1" RelatedObjectID="Beneficiary_1" id="Relation_1">
                </Relation>
    <OriginatingObjectType tc="4">Holding</OriginatingObjectType>
    <RelatedObjectType tc="6">Party</RelatedObjectType>
    <RelationRoleCode tc="34">Primary Beneficiary</RelationRoleCode>
    <BeneficiaryDesignation tc="1">Named</BeneficiaryDesignation>
  </TXLifeRequest>
</TXLife>

期望的结果：

<TXLife Version="2.25.00">
    <UserAuthRequest>
        <UserLoginName>*****</UserLoginName>
        <UserPswd>
            <CryptType>None</CryptType>
            <Pswd>****</Pswd>
        </UserPswd>
    </UserAuthRequest>
    <TXLifeRequest PrimaryObjectID="Policy_1">
        <TransRefGUID>706D67C1-CC4D-11CF-91FB444554540000</TransRefGUID>
        <TransType tc="502">Holding Change</TransType>
        <TransExeDate>2006-11-19</TransExeDate>
        <TransExeTime>13:15:33-07:00</TransExeTime>
        <ChangeSubType>
            <ChangeTC tc="9">Change Participant</ChangeTC>
        </ChangeSubType>
        <OLifE>
            <Holding id="Policy_1">
                <HoldingTypeCode tc="2">Policy</HoldingTypeCode>
                <Policy>
                    <PolNumber>1234567</PolNumber>
                    <LineOfBusiness tc="2">Annuity</LineOfBusiness>
                    <Annuity></Annuity>
                </Policy>
            </Holding>
            <Party id="Beneficiary_1">
                <PartyTypeCode tc="2">Organization</PartyTypeCode>
                <FullName>The Smith Trust</FullName>
                <Organization>
                    <OrgForm tc="16">Trust</OrgForm>
                    <DBA>The Smith Trust</DBA>
                </Organization>
            </Party>
            <Relation id="Relation_1" OriginatingObjectID="Policy_1" RelatedObjectID="Beneficiary_1">
                <OriginatingObjectType tc="4">Holding</OriginatingObjectType>
                <RelatedObjectType tc="6">Party</RelatedObjectType>
                <RelationRoleCode tc="34">Primary Beneficiary</RelationRoleCode>
                <BeneficiaryDesignation tc="1">Named</BeneficiaryDesignation>
            </Relation>
        </OLifE>
    </TXLifeRequest>
</TXLife>

如何获得如上所示的分层结果？

Answer 1

你开了个好头！认为通过您的代码 bit-by-bit 并解释需要调整的地方并提出一些改进建议是最简单的：

读取和清理数据

# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv').fillna('NA')
# # Remove rows with comments
# dfc = dfc[~dfc['Element'].str.contains("<cyfunction")].fillna('')
dfc['Attribute'] = dfc['Attribute'].apply(lambda x: x.replace("'", '"'))

.apply 工作正常，但还有一个 .str.replace() 方法可以使用，它会更简洁明了（.str 允许您处理 a 的值列作为字符串类型并相应地对其进行操作）。

添加根

# Add the root element for xml
root = etree.Element(dfc['Element'][0])
tree = root.getroottree()

这一切都很好！

遍历行

for prnt, elem, txt, attr in dfc[['Parent', 'Element', 'Text', 'Attribute']][1:].values:

既然你要检索所有的列，你不需要索引到 dfc 到 select 它们，所以你可以把那部分拿出来：

for prnt, elem, txt, attr in dfc[1:].values:

这很好用，但是有 built-in 方法可以迭代 DataFrame 中的项目，我们可以使用 itertuples()。这 returns 每行一个 NamedTuple，其中包括索引（基本上是行号）作为元组中的第一项，因此我们需要为此进行调整：

for idx, prnt, elem, txt, attr in dfc[1:].itertuples():

设置变量

    # Convert attributes to json (dictionary)
    attrib = json.loads(attr)
    # list(root) = root.getchildren()
    children = [item for item in str(list(root)).split(' ')]
    rootstring = str(root).split(' ')[1][1:].values:

早点把单引号换成双引号是个好技巧，这样我们就可以用json把属性变成字典了。每个 Element 都有一个 .tag 属性，我们可以使用它来获取名称，这就是我们在这里想要的：

children = [item.tag for item in root]
rootstring = root.tag

list(root) 或 root.getchildren() 都会给我们一个 root 的 child 元素的列表，但我们也可以使用 for ... in 遍历它们像这样 root。

将元素添加到树中

#     If the parent is root then add the element as child (appaers to work?)
    if prnt == str(root).split(' ')[1]:
        parent = etree.SubElement(root, elem)

    # If the parent is not root but is one of its children then add the elements to the parent
    elif not prnt == rootstring and prnt in children:
        child = etree.SubElement(parent, elem, attrib).text = txt

#     # If the parent is not in root's descendents then add the childern to the parents
    elif not prnt in [str(item).split(' ') for item in root.iterdescendants()]:
        child = etree.SubElement(parent, elem, attrib).text = txt

str(root).split(' ')[1] 正是我们在上面设置的 rootstring，因此我们可以使用它来代替
因为我们已经在第一个 if 语句中检查了 if prnt == rootstring，如果我们已经达到第一个 elif，我们知道它不可能相等，所以我们不不需要再检查了
当我们创建 child 时，我们同时有两个赋值......它以某种方式成功创建了 child 及其文本（！），但它意味着 child 设置为 text 而不是新的 SubElement。最好分两步完成。
当我们寻找 parent 时，我们目前正在创建一个列表列表（split() returns 一个列表），因此它不起作用。我们需要项目标签。

进行所有这些更改给我们：

#     If the parent is root then add the element as child (appaers to work?)
    if prnt == rootstring:
        parent = etree.SubElement(root, elem)

    # If the parent is not root but is one of its children then add the elements to the parent
    elif prnt in children:
        child = etree.SubElement(parent, elem, attrib)
        child.text = txt

#     # If the parent is not in root's descendents then add the childern to the parents
    elif not prnt in [item.tag for item in root.iterdescendants()]:
        child = etree.SubElement(parent, elem, attrib)
        child.text = txt

但是这里有几个问题。

第一部分（if 语句）没问题。

在第二部分（第一个elif语句），我们检查新元素的parent是否是root的children之一。如果是，我们将新元素添加为 parent 的 child。 parent 肯定是 root 的 children 的 one，但我们还没有实际检查它是否是 正确的。这只是我们添加到 root 的最后一件事。幸运的是，因为我们的 CSV 中所有元素都是按顺序排列的，所以这是正确的，但最好更明确一点。

在第三部分（第二个 elif）中，最好检查 prnt 是否已经存在于树的下方。但是目前，如果 prnt 不存在，我们只是将元素添加到 parent，这不是它实际的 parent！如果 prnt 确实存在，我们根本不会添加该元素（因此我们需要一个 else 子句）。

解决方案

谢天谢地，有一个简单的方法：我们可以使用 .find() 找到 prnt 元素，无论它在树中的哪个位置，然后将新元素添加到那里。这也使整个事情变得更短！

for idx, prnt, elem, txt, attr in dfc[1:].itertuples():
    # Convert attributes to json (dictionary)
    attrib = json.loads(attr)
    # Find parent element
    if prnt == root.tag:
        parent = root
    else:
        parent = root.find(".//" + prnt)
    child = etree.SubElement(parent, elem, attrib)
    child.text = txt

root.find(".//" + prnt) 中的 .// 表示它将在树中的任何位置搜索匹配的元素标签（在此处了解更多信息：https://lxml.de/tutorial.html#elementpath）。

最终脚本

import lxml.etree as etree
import pandas as pd
import json

# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv').fillna("NA")
dfc['Attribute'] = dfc['Attribute'].str.replace("'", '"').apply(lambda s: json.loads(s))

# Add the root element for xml
root = etree.Element(dfc['Element'][0], dfc['Attribute'][0])

for idx, prnt, elem, txt, attr in dfc[1:].itertuples():
    # Fix text
    text = txt.strip()
    if not text:
        text = None
    # Find parent element
    if prnt == root.tag:
        parent = root
    else:
        parent = root.find(".//" + prnt)
    # Create element
    child = etree.SubElement(parent, elem, attr)
    child.text = text

xml_string = etree.tostring(root, pretty_print=True).decode().replace(">NA<", "><")
print(xml_string)

我又做了一些改动：

我将属性字典的 json.loads 位移动到我们更改引号的位置，并使用 apply 将其添加到末尾。我们在那里需要它，这样当我们创建根元素时字典就可以使用了。
让漂亮的打印正常工作存在一些问题，这就是 "Fix text" 部分的目的（请参阅 this Stack Overflow question 了解我遇到的问题）。
最好有 .fillna("")（用空字符串填充），但如果我们这样做，我们最终会得到 </Annuity> 而不是 <Annuity></Annuity>（这是合法的XML - 如果你有一个没有文本或子元素的元素，你可以只做结束标签）。但是为了让它按照我们的意愿出现，我们需要它有一些 'content' 以便创建开始标签。所以我将它保留为 .fillna("NA") 然后在最后，手动替换输出字符串中的它。

同样值得一提的是，此脚本对输入数据做出（至少）四个假设：

parent 元素是在它们的任何 children 之前创建的（即它们在 CSV 文件中更靠前的位置）
元素名称是唯一的（或者至少，任何重复的名称都没有任何 children，因此我们永远不会在可能有多个名称的地方使用 .find()匹配;.find() 总是 returns 第一个匹配）
没有您想要在最终 XML 中保留的 'NA' 的任何文本值（当我们删除虚假的 'NA' 文本时，它们也会被删除来自 Annuity 元素）
只包含空格的文本不需要保留

数据框到分层 xml

dataframe to hierarchical xml

python

xml

lxml

pandas

读取和清理数据

添加根

遍历行

设置变量

将元素添加到树中

解决方案

最终脚本