如何将 XML 文件的一部分导出到 pandas 中的多级 DataFrame?

How do I export part of an XML file to a multi-level DataFrame in pandas?

我正在尝试将 XML 文件的一部分导出到多级 DataFrame,因为我发现使用它更方便。该文件的一个例子是:

<file filename="stack_example" created="today">
    <unit time="day" volume="cm3" surface="cm2"/>
    <zone z_id="10">
        <surfacehistory type="calculation">
            <surfacedata time-begin="1" time-end="2">
                <thing identity="1">
                    <location l-identity="2"> 1.256</location>
                    <location l-identity="45"> 2.3</location>
                </thing>
                <thing identity="3">
                    <location l-identity="2"> 1.6</location>
                    <location l-identity="4"> 2.5</location>
                    <location l-identity="17"> 2.4</location>
                </thing>
            </surfacedata>
            <surfacedata time-begin="2" time-end="3">
                <thing identity="1">
                    <location l-identity="78"> 3.2</location>
                </thing>
                <thing identity="5">
                    <location l-identity="2"> 1.7</location>
                    <location l-identity="7"> 4.5</location>
                </thing>
            </surfacedata>
        </surfacehistory>
    </zone>
</file>

此示例的理想输出将是一个 Pandas 数据框,类似于:

time-begin  time-end     thing  location    surface
         1         2         1         2      1,256
                                      45        2,3
                             3         2        1,6
                                       4        2,5
                                      17        2,4
         2         3         1        78        3,2
                             5         2        1,7
                                       7        4,5

这是我写的当前代码:

import pandas as pd
from bs4 import BeautifulSoup
import lxml

datas = open("stack_example.xml","r")
doc = BeautifulSoup(datas.read(), "lxml")
doc.unit.get("surface")

l = []
temp={}

surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
    time_begin = surfacedata.get("time-begin")
    time_end = surfacedata.get("time-end")

    temp["time_begin"]=[time_begin]
    temp["time_end"]=[time_end]

    things = surfacedata.find_all("thing", recursive=False)
    for thing in thingss:
        identity = thing.get("identity")
        temp["thing"]=[identity]
 
        locations = thing.find_all("location", recursive=False)
        for location in locations:
            l_identity = location.get("l-identity")
            surface = location.getText()
            temp["surface"]=[surface]
            temp["location"]=[l_identity]
        l.append(pd.DataFrame(temp))
        
res = pd.concat(l, ignore_index=True).fillna(0.)

它只获取所有 things 的最后一个 location 因为位置在循环中被刷新,但我不确定如何从这一点达到预期的结果。

你快成功了。只是稍微改变了逻辑。不过看起来不错。

datas = '''<file filename="stack_example" created="today">
    <unit time="day" volume="cm3" surface="cm2"/>
    <zone z_id="10">
        <surfacehistory type="calculation">
            <surfacedata time-begin="1" time-end="2">
                <thing identity="1">
                    <location l-identity="2"> 1.256</location>
                    <location l-identity="45"> 2.3</location>
                </thing>
                <thing identity="3">
                    <location l-identity="2"> 1.6</location>
                    <location l-identity="4"> 2.5</location>
                    <location l-identity="17"> 2.4</location>
                </thing>
            </surfacedata>
            <surfacedata time-begin="2" time-end="3">
                <thing identity="1">
                    <location l-identity="78"> 3.2</location>
                </thing>
                <thing identity="5">
                    <location l-identity="2"> 1.7</location>
                    <location l-identity="7"> 4.5</location>
                </thing>
            </surfacedata>
        </surfacehistory>
    </zone>
</file>'''

                                      

代码:

import pandas as pd
from bs4 import BeautifulSoup
import lxml

#datas = open("stack_example.xml","r")
#doc = BeautifulSoup(datas.read(), "lxml")
doc = BeautifulSoup(datas, "lxml")
doc.unit.get("surface")

rows = []
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
    row = {}
    time_begin = surfacedata.get("time-begin")
    time_end = surfacedata.get("time-end")

    row = {'time-begin':time_begin,
         'time-end':time_end}


    things = surfacedata.find_all("thing", recursive=False)
    for thing in things:
        identity = thing.get("identity")
        row.update({'thing':identity})
        
        locations = thing.find_all("location", recursive=False)
        for location in locations:
            locationStr = location['l-identity']   
            surface = location.text.strip()
            
            row.update({'location':locationStr,
                        'surface':surface})
            
            row_copy = row.copy()
            rows.append(row_copy)
            
df = pd.DataFrame(rows)        

输出:

print(df)
  time-begin time-end thing location surface
0          1        2     1        2   1.256
1          1        2     1       45     2.3
2          1        2     3        2     1.6
3          1        2     3        4     2.5
4          1        2     3       17     2.4
5          2        3     1       78     3.2
6          2        3     5        2     1.7
7          2        3     5        7     4.5

您可以根据列设置索引:

df.set_index(df.columns.to_list())

查看代码中的一些问题:

  • for thing in thingss:有错别字

  • 而不是temp["location"]=[identity]设置temp["thing"]=[identity]

  • 更改位置以获得正确的列顺序

    temp["surface"]=[surface]
    temp["location"]=[l_identity]
    

例子

import pandas as pd
from bs4 import BeautifulSoup

xml = '''
<file filename="stack_example" created="today">
    <unit time="day" volume="cm3" surface="cm2"/>
    <zone z_id="10">
        <surfacehistory type="calculation">
            <surfacedata time-begin="1" time-end="2">
                <thing identity="1">
                    <location l-identity="2"> 1.256</location>
                    <location l-identity="45"> 2.3</location>
                </thing>
                <thing identity="3">
                    <location l-identity="2"> 1.6</location>
                    <location l-identity="4"> 2.5</location>
                    <location l-identity="17"> 2.4</location>
                </thing>
            </surfacedata>
            <surfacedata time-begin="2" time-end="3">
                <thing identity="1">
                    <location l-identity="78"> 3.2</location>
                </thing>
                <thing identity="5">
                    <location l-identity="2"> 1.7</location>
                    <location l-identity="7"> 4.5</location>
                </thing>
            </surfacedata>
        </surfacehistory>
    </zone>
</file>
'''

doc = BeautifulSoup(xml, "lxml")


l = []
temp = {}

surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
    time_begin = surfacedata.get("time-begin")
    time_end = surfacedata.get("time-end")

    temp["time_begin"]=[time_begin]
    temp["time_end"]=[time_end]

    things = surfacedata.find_all("thing", recursive=False)
    for thing in things:
        identity = thing.get("identity")
        temp["thing"]=[identity]
 
        locations = thing.find_all("location", recursive=False)
        for location in locations:
            l_identity = location.get("l-identity")
            surface = location.getText()
            temp["location"]=[l_identity]
            temp["surface"]=[surface]
            l.append(pd.DataFrame(temp))

df = pd.concat(l, ignore_index=True).fillna(0.)
df.set_index(df.columns.to_list())

由于您正在使用 lxml,请考虑 v1.3 中引入的 XSLT, the special-purpose language designed to transform XML files, and the recent IO module, pandas.read_xml。虽然此方法默认适用于扁平 XML 文件,但其 stylesheet 参数允许您将原始输入转换为更扁平的格式以进行数据帧迁移。

具体来说,XSLT 向下解析到​​ <location> 节点,并将父属性和祖先属性作为具有重复节点的扁平结构的兄弟姐妹。

XSLT (另存为.xsl文件,一个特殊的.xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
   <xsl:output omit-xml-declaration="yes" indent="yes"/>
   <xsl:strip-space elements="*"/>
  
   <xsl:template match="/file">
     <data>
       <xsl:apply-templates select="descendant::location"/>
     </data>
   </xsl:template>

   <xsl:template match="location">
     <row>
       <time_begin><xsl:value-of select="ancestor::surfacedata/@time-begin"/></time_begin>
       <time_end><xsl:value-of select="ancestor::surfacedata/@time-end"/></time_end>
       <thing><xsl:value-of select="parent::thing/@identity"/></thing>
       <location><xsl:value-of select="@l-identity"/></location>
       <surface><xsl:value-of select="normalize-space()"/></surface>
     </row>
   </xsl:template>
  
</xsl:stylesheet>

Online Demo

Python (不需要其他包)

import pandas as pd

xml_file = "input.xml"
xsl_file = "style.xsl"

surface_data_df = pd.read_xml(xml_file, stylesheet=xsl_file)

surface_data_df
#    time_begin  time_end  thing  location  surface
# 0           1         2      1         2    1.256
# 1           1         2      1        45    2.300
# 2           1         2      3         2    1.600
# 3           1         2      3         4    2.500
# 4           1         2      3        17    2.400
# 5           2         3      1        78    3.200
# 6           2         3      5         2    1.700
# 7           2         3      5         7    4.500