如何将 XML 文件的一部分导出到 pandas 中的多级 DataFrame?
How do I export part of an XML file to a multi-level DataFrame in pandas?
我正在尝试将 XML 文件的一部分导出到多级 DataFrame,因为我发现使用它更方便。该文件的一个例子是:
<file filename="stack_example" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone z_id="10">
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l-identity="2"> 1.256</location>
<location l-identity="45"> 2.3</location>
</thing>
<thing identity="3">
<location l-identity="2"> 1.6</location>
<location l-identity="4"> 2.5</location>
<location l-identity="17"> 2.4</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l-identity="78"> 3.2</location>
</thing>
<thing identity="5">
<location l-identity="2"> 1.7</location>
<location l-identity="7"> 4.5</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>
此示例的理想输出将是一个 Pandas 数据框,类似于:
time-begin time-end thing location surface
1 2 1 2 1,256
45 2,3
3 2 1,6
4 2,5
17 2,4
2 3 1 78 3,2
5 2 1,7
7 4,5
这是我写的当前代码:
import pandas as pd
from bs4 import BeautifulSoup
import lxml
datas = open("stack_example.xml","r")
doc = BeautifulSoup(datas.read(), "lxml")
doc.unit.get("surface")
l = []
temp={}
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
time_begin = surfacedata.get("time-begin")
time_end = surfacedata.get("time-end")
temp["time_begin"]=[time_begin]
temp["time_end"]=[time_end]
things = surfacedata.find_all("thing", recursive=False)
for thing in thingss:
identity = thing.get("identity")
temp["thing"]=[identity]
locations = thing.find_all("location", recursive=False)
for location in locations:
l_identity = location.get("l-identity")
surface = location.getText()
temp["surface"]=[surface]
temp["location"]=[l_identity]
l.append(pd.DataFrame(temp))
res = pd.concat(l, ignore_index=True).fillna(0.)
它只获取所有 things 的最后一个 location 因为位置在循环中被刷新,但我不确定如何从这一点达到预期的结果。
你快成功了。只是稍微改变了逻辑。不过看起来不错。
datas = '''<file filename="stack_example" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone z_id="10">
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l-identity="2"> 1.256</location>
<location l-identity="45"> 2.3</location>
</thing>
<thing identity="3">
<location l-identity="2"> 1.6</location>
<location l-identity="4"> 2.5</location>
<location l-identity="17"> 2.4</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l-identity="78"> 3.2</location>
</thing>
<thing identity="5">
<location l-identity="2"> 1.7</location>
<location l-identity="7"> 4.5</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>'''
代码:
import pandas as pd
from bs4 import BeautifulSoup
import lxml
#datas = open("stack_example.xml","r")
#doc = BeautifulSoup(datas.read(), "lxml")
doc = BeautifulSoup(datas, "lxml")
doc.unit.get("surface")
rows = []
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
row = {}
time_begin = surfacedata.get("time-begin")
time_end = surfacedata.get("time-end")
row = {'time-begin':time_begin,
'time-end':time_end}
things = surfacedata.find_all("thing", recursive=False)
for thing in things:
identity = thing.get("identity")
row.update({'thing':identity})
locations = thing.find_all("location", recursive=False)
for location in locations:
locationStr = location['l-identity']
surface = location.text.strip()
row.update({'location':locationStr,
'surface':surface})
row_copy = row.copy()
rows.append(row_copy)
df = pd.DataFrame(rows)
输出:
print(df)
time-begin time-end thing location surface
0 1 2 1 2 1.256
1 1 2 1 45 2.3
2 1 2 3 2 1.6
3 1 2 3 4 2.5
4 1 2 3 17 2.4
5 2 3 1 78 3.2
6 2 3 5 2 1.7
7 2 3 5 7 4.5
您可以根据列设置索引:
df.set_index(df.columns.to_list())
查看代码中的一些问题:
for thing in thingss:
有错别字
而不是temp["location"]=[identity]
设置temp["thing"]=[identity]
更改位置以获得正确的列顺序
temp["surface"]=[surface]
temp["location"]=[l_identity]
例子
import pandas as pd
from bs4 import BeautifulSoup
xml = '''
<file filename="stack_example" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone z_id="10">
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l-identity="2"> 1.256</location>
<location l-identity="45"> 2.3</location>
</thing>
<thing identity="3">
<location l-identity="2"> 1.6</location>
<location l-identity="4"> 2.5</location>
<location l-identity="17"> 2.4</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l-identity="78"> 3.2</location>
</thing>
<thing identity="5">
<location l-identity="2"> 1.7</location>
<location l-identity="7"> 4.5</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>
'''
doc = BeautifulSoup(xml, "lxml")
l = []
temp = {}
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
time_begin = surfacedata.get("time-begin")
time_end = surfacedata.get("time-end")
temp["time_begin"]=[time_begin]
temp["time_end"]=[time_end]
things = surfacedata.find_all("thing", recursive=False)
for thing in things:
identity = thing.get("identity")
temp["thing"]=[identity]
locations = thing.find_all("location", recursive=False)
for location in locations:
l_identity = location.get("l-identity")
surface = location.getText()
temp["location"]=[l_identity]
temp["surface"]=[surface]
l.append(pd.DataFrame(temp))
df = pd.concat(l, ignore_index=True).fillna(0.)
df.set_index(df.columns.to_list())
由于您正在使用 lxml
,请考虑 v1.3 中引入的 XSLT, the special-purpose language designed to transform XML files, and the recent IO module, pandas.read_xml
。虽然此方法默认适用于扁平 XML 文件,但其 stylesheet
参数允许您将原始输入转换为更扁平的格式以进行数据帧迁移。
具体来说,XSLT 向下解析到 <location>
节点,并将父属性和祖先属性作为具有重复节点的扁平结构的兄弟姐妹。
XSLT (另存为.xsl文件,一个特殊的.xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/file">
<data>
<xsl:apply-templates select="descendant::location"/>
</data>
</xsl:template>
<xsl:template match="location">
<row>
<time_begin><xsl:value-of select="ancestor::surfacedata/@time-begin"/></time_begin>
<time_end><xsl:value-of select="ancestor::surfacedata/@time-end"/></time_end>
<thing><xsl:value-of select="parent::thing/@identity"/></thing>
<location><xsl:value-of select="@l-identity"/></location>
<surface><xsl:value-of select="normalize-space()"/></surface>
</row>
</xsl:template>
</xsl:stylesheet>
Python (不需要其他包)
import pandas as pd
xml_file = "input.xml"
xsl_file = "style.xsl"
surface_data_df = pd.read_xml(xml_file, stylesheet=xsl_file)
surface_data_df
# time_begin time_end thing location surface
# 0 1 2 1 2 1.256
# 1 1 2 1 45 2.300
# 2 1 2 3 2 1.600
# 3 1 2 3 4 2.500
# 4 1 2 3 17 2.400
# 5 2 3 1 78 3.200
# 6 2 3 5 2 1.700
# 7 2 3 5 7 4.500
我正在尝试将 XML 文件的一部分导出到多级 DataFrame,因为我发现使用它更方便。该文件的一个例子是:
<file filename="stack_example" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone z_id="10">
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l-identity="2"> 1.256</location>
<location l-identity="45"> 2.3</location>
</thing>
<thing identity="3">
<location l-identity="2"> 1.6</location>
<location l-identity="4"> 2.5</location>
<location l-identity="17"> 2.4</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l-identity="78"> 3.2</location>
</thing>
<thing identity="5">
<location l-identity="2"> 1.7</location>
<location l-identity="7"> 4.5</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>
此示例的理想输出将是一个 Pandas 数据框,类似于:
time-begin time-end thing location surface
1 2 1 2 1,256
45 2,3
3 2 1,6
4 2,5
17 2,4
2 3 1 78 3,2
5 2 1,7
7 4,5
这是我写的当前代码:
import pandas as pd
from bs4 import BeautifulSoup
import lxml
datas = open("stack_example.xml","r")
doc = BeautifulSoup(datas.read(), "lxml")
doc.unit.get("surface")
l = []
temp={}
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
time_begin = surfacedata.get("time-begin")
time_end = surfacedata.get("time-end")
temp["time_begin"]=[time_begin]
temp["time_end"]=[time_end]
things = surfacedata.find_all("thing", recursive=False)
for thing in thingss:
identity = thing.get("identity")
temp["thing"]=[identity]
locations = thing.find_all("location", recursive=False)
for location in locations:
l_identity = location.get("l-identity")
surface = location.getText()
temp["surface"]=[surface]
temp["location"]=[l_identity]
l.append(pd.DataFrame(temp))
res = pd.concat(l, ignore_index=True).fillna(0.)
它只获取所有 things 的最后一个 location 因为位置在循环中被刷新,但我不确定如何从这一点达到预期的结果。
你快成功了。只是稍微改变了逻辑。不过看起来不错。
datas = '''<file filename="stack_example" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone z_id="10">
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l-identity="2"> 1.256</location>
<location l-identity="45"> 2.3</location>
</thing>
<thing identity="3">
<location l-identity="2"> 1.6</location>
<location l-identity="4"> 2.5</location>
<location l-identity="17"> 2.4</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l-identity="78"> 3.2</location>
</thing>
<thing identity="5">
<location l-identity="2"> 1.7</location>
<location l-identity="7"> 4.5</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>'''
代码:
import pandas as pd
from bs4 import BeautifulSoup
import lxml
#datas = open("stack_example.xml","r")
#doc = BeautifulSoup(datas.read(), "lxml")
doc = BeautifulSoup(datas, "lxml")
doc.unit.get("surface")
rows = []
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
row = {}
time_begin = surfacedata.get("time-begin")
time_end = surfacedata.get("time-end")
row = {'time-begin':time_begin,
'time-end':time_end}
things = surfacedata.find_all("thing", recursive=False)
for thing in things:
identity = thing.get("identity")
row.update({'thing':identity})
locations = thing.find_all("location", recursive=False)
for location in locations:
locationStr = location['l-identity']
surface = location.text.strip()
row.update({'location':locationStr,
'surface':surface})
row_copy = row.copy()
rows.append(row_copy)
df = pd.DataFrame(rows)
输出:
print(df)
time-begin time-end thing location surface
0 1 2 1 2 1.256
1 1 2 1 45 2.3
2 1 2 3 2 1.6
3 1 2 3 4 2.5
4 1 2 3 17 2.4
5 2 3 1 78 3.2
6 2 3 5 2 1.7
7 2 3 5 7 4.5
您可以根据列设置索引:
df.set_index(df.columns.to_list())
查看代码中的一些问题:
for thing in thingss:
有错别字而不是
temp["location"]=[identity]
设置temp["thing"]=[identity]
更改位置以获得正确的列顺序
temp["surface"]=[surface] temp["location"]=[l_identity]
例子
import pandas as pd
from bs4 import BeautifulSoup
xml = '''
<file filename="stack_example" created="today">
<unit time="day" volume="cm3" surface="cm2"/>
<zone z_id="10">
<surfacehistory type="calculation">
<surfacedata time-begin="1" time-end="2">
<thing identity="1">
<location l-identity="2"> 1.256</location>
<location l-identity="45"> 2.3</location>
</thing>
<thing identity="3">
<location l-identity="2"> 1.6</location>
<location l-identity="4"> 2.5</location>
<location l-identity="17"> 2.4</location>
</thing>
</surfacedata>
<surfacedata time-begin="2" time-end="3">
<thing identity="1">
<location l-identity="78"> 3.2</location>
</thing>
<thing identity="5">
<location l-identity="2"> 1.7</location>
<location l-identity="7"> 4.5</location>
</thing>
</surfacedata>
</surfacehistory>
</zone>
</file>
'''
doc = BeautifulSoup(xml, "lxml")
l = []
temp = {}
surfacedatas = doc.surfacehistory.find_all("surfacedata")
for surfacedata in surfacedatas:
time_begin = surfacedata.get("time-begin")
time_end = surfacedata.get("time-end")
temp["time_begin"]=[time_begin]
temp["time_end"]=[time_end]
things = surfacedata.find_all("thing", recursive=False)
for thing in things:
identity = thing.get("identity")
temp["thing"]=[identity]
locations = thing.find_all("location", recursive=False)
for location in locations:
l_identity = location.get("l-identity")
surface = location.getText()
temp["location"]=[l_identity]
temp["surface"]=[surface]
l.append(pd.DataFrame(temp))
df = pd.concat(l, ignore_index=True).fillna(0.)
df.set_index(df.columns.to_list())
由于您正在使用 lxml
,请考虑 v1.3 中引入的 XSLT, the special-purpose language designed to transform XML files, and the recent IO module, pandas.read_xml
。虽然此方法默认适用于扁平 XML 文件,但其 stylesheet
参数允许您将原始输入转换为更扁平的格式以进行数据帧迁移。
具体来说,XSLT 向下解析到 <location>
节点,并将父属性和祖先属性作为具有重复节点的扁平结构的兄弟姐妹。
XSLT (另存为.xsl文件,一个特殊的.xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/file">
<data>
<xsl:apply-templates select="descendant::location"/>
</data>
</xsl:template>
<xsl:template match="location">
<row>
<time_begin><xsl:value-of select="ancestor::surfacedata/@time-begin"/></time_begin>
<time_end><xsl:value-of select="ancestor::surfacedata/@time-end"/></time_end>
<thing><xsl:value-of select="parent::thing/@identity"/></thing>
<location><xsl:value-of select="@l-identity"/></location>
<surface><xsl:value-of select="normalize-space()"/></surface>
</row>
</xsl:template>
</xsl:stylesheet>
Python (不需要其他包)
import pandas as pd
xml_file = "input.xml"
xsl_file = "style.xsl"
surface_data_df = pd.read_xml(xml_file, stylesheet=xsl_file)
surface_data_df
# time_begin time_end thing location surface
# 0 1 2 1 2 1.256
# 1 1 2 1 45 2.300
# 2 1 2 3 2 1.600
# 3 1 2 3 4 2.500
# 4 1 2 3 17 2.400
# 5 2 3 1 78 3.200
# 6 2 3 5 2 1.700
# 7 2 3 5 7 4.500