使用 pandas read_xml 解析 xml 时包括父属性
Including parent attributes when parsing xml using pandas read_xml
我正在尝试使用 pandas read_xml
读取 xml 文件,但很难在输出中包含父属性。这可能使用 read_xml
还是我需要使用不同的解析器?示例 xml,我在下面尝试过的和期望的输出。
import pandas as pd
xml = '''<?xml version="1.0" encoding="ISO-8859-1" ?>
<LocationList>
<LocationData LocationId="123" name="LocationName">
<ChannelData channelId="1" name="Temperature" >
<Values>
<VT t="2020-08-18T20:30:00">3.2</VT>
<VT t="2020-08-18T21:30:00">3.3</VT>
<VT t="2020-08-18T22:30:00">3.2</VT>
</Values>
</ChannelData>
<ChannelData channelId="2" name="Speed" >
<Values>
<VT t="2020-08-18T20:30:00">21.7</VT>
<VT t="2020-08-18T21:30:00">21.8</VT>
<VT t="2020-08-18T22:30:00">22.0</VT>
</Values>
</ChannelData>
</LocationData>
</LocationList>
'''
# Getting all VT values, but no parent attributes
pd.read_xml(xml, xpath='.//VT')
"""
t VT
0 2020-08-18T20:30:00 3.2
1 2020-08-18T21:30:00 3.3
2 2020-08-18T22:30:00 3.2
3 2020-08-18T20:30:00 21.7
4 2020-08-18T21:30:00 21.8
5 2020-08-18T22:30:00 22.0
"""
# Alternative to read one channel at the time
# But want to avoid opening file several times since they can be large
# For example like this, and then build the dataframe in a loop
pd.read_xml(xml, xpath='.//ChannelData[@channelId="1"]/Values/VT')
"""
t VT
0 2020-08-18T20:30:00 3.2
1 2020-08-18T21:30:00 3.3
2 2020-08-18T22:30:00 3.2
"""
所需输出的示例(可以包含所有或选择的父属性),
理想情况下只读取 xml 文件一次
t VT channelId LocationId
0 2020-08-18T20:30:00 3.2 1 123
1 2020-08-18T21:30:00 3.3 1 123
2 2020-08-18T22:30:00 3.2 1 123
3 2020-08-18T20:30:00 21.7 2 123
4 2020-08-18T21:30:00 21.8 2 123
5 2020-08-18T22:30:00 22.0 2 123
使用 ElementTree - 见下文
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="ISO-8859-1" ?>
<LocationList>
<LocationData LocationId="123" name="LocationName">
<ChannelData channelId="1" name="Temperature" >
<Values>
<VT t="2020-08-18T20:30:00">3.2</VT>
<VT t="2020-08-18T21:30:00">3.3</VT>
<VT t="2020-08-18T22:30:00">3.2</VT>
</Values>
</ChannelData>
<ChannelData channelId="2" name="Speed" >
<Values>
<VT t="2020-08-18T20:30:00">21.7</VT>
<VT t="2020-08-18T21:30:00">21.8</VT>
<VT t="2020-08-18T22:30:00">22.0</VT>
</Values>
</ChannelData>
</LocationData>
</LocationList>
'''
data = []
root = ET.fromstring(xml)
for ld in root.findall('.//LocationData'):
location = ld.attrib['LocationId']
for cd in ld.findall('ChannelData'):
channel = cd.attrib['channelId']
for vt in cd.findall('.//VT'):
data.append({'t': vt.attrib['t'],'VT': vt.text,'channelId':channel,'LocationId':location})
df = pd.DataFrame(data)
print(df)
输出
t VT channelId LocationId
0 2020-08-18T20:30:00 3.2 1 123
1 2020-08-18T21:30:00 3.3 1 123
2 2020-08-18T22:30:00 3.2 1 123
3 2020-08-18T20:30:00 21.7 2 123
4 2020-08-18T21:30:00 21.8 2 123
5 2020-08-18T22:30:00 22.0 2 123
一个选项是使路径通用,从 LocationData 节点开始尽可能多地捕获:
(pd
.read_xml(xml, xpath='.//*')
.dropna(how = 'all', axis = 1)
.assign(channelId = lambda df: df.channelId.ffill(),
LocationId = lambda df: df.LocationId.ffill(),
name = lambda df: df.name.ffill())
.dropna(subset='t')
)
LocationId name channelId VT t
3 123.0 Temperature 1.0 3.2 2020-08-18T20:30:00
4 123.0 Temperature 1.0 3.3 2020-08-18T21:30:00
5 123.0 Temperature 1.0 3.2 2020-08-18T22:30:00
8 123.0 Speed 2.0 21.7 2020-08-18T20:30:00
9 123.0 Speed 2.0 21.8 2020-08-18T21:30:00
10 123.0 Speed 2.0 22.0 2020-08-18T22:30:00
我正在尝试使用 pandas read_xml
读取 xml 文件,但很难在输出中包含父属性。这可能使用 read_xml
还是我需要使用不同的解析器?示例 xml,我在下面尝试过的和期望的输出。
import pandas as pd
xml = '''<?xml version="1.0" encoding="ISO-8859-1" ?>
<LocationList>
<LocationData LocationId="123" name="LocationName">
<ChannelData channelId="1" name="Temperature" >
<Values>
<VT t="2020-08-18T20:30:00">3.2</VT>
<VT t="2020-08-18T21:30:00">3.3</VT>
<VT t="2020-08-18T22:30:00">3.2</VT>
</Values>
</ChannelData>
<ChannelData channelId="2" name="Speed" >
<Values>
<VT t="2020-08-18T20:30:00">21.7</VT>
<VT t="2020-08-18T21:30:00">21.8</VT>
<VT t="2020-08-18T22:30:00">22.0</VT>
</Values>
</ChannelData>
</LocationData>
</LocationList>
'''
# Getting all VT values, but no parent attributes
pd.read_xml(xml, xpath='.//VT')
"""
t VT
0 2020-08-18T20:30:00 3.2
1 2020-08-18T21:30:00 3.3
2 2020-08-18T22:30:00 3.2
3 2020-08-18T20:30:00 21.7
4 2020-08-18T21:30:00 21.8
5 2020-08-18T22:30:00 22.0
"""
# Alternative to read one channel at the time
# But want to avoid opening file several times since they can be large
# For example like this, and then build the dataframe in a loop
pd.read_xml(xml, xpath='.//ChannelData[@channelId="1"]/Values/VT')
"""
t VT
0 2020-08-18T20:30:00 3.2
1 2020-08-18T21:30:00 3.3
2 2020-08-18T22:30:00 3.2
"""
所需输出的示例(可以包含所有或选择的父属性), 理想情况下只读取 xml 文件一次
t VT channelId LocationId
0 2020-08-18T20:30:00 3.2 1 123
1 2020-08-18T21:30:00 3.3 1 123
2 2020-08-18T22:30:00 3.2 1 123
3 2020-08-18T20:30:00 21.7 2 123
4 2020-08-18T21:30:00 21.8 2 123
5 2020-08-18T22:30:00 22.0 2 123
使用 ElementTree - 见下文
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0" encoding="ISO-8859-1" ?>
<LocationList>
<LocationData LocationId="123" name="LocationName">
<ChannelData channelId="1" name="Temperature" >
<Values>
<VT t="2020-08-18T20:30:00">3.2</VT>
<VT t="2020-08-18T21:30:00">3.3</VT>
<VT t="2020-08-18T22:30:00">3.2</VT>
</Values>
</ChannelData>
<ChannelData channelId="2" name="Speed" >
<Values>
<VT t="2020-08-18T20:30:00">21.7</VT>
<VT t="2020-08-18T21:30:00">21.8</VT>
<VT t="2020-08-18T22:30:00">22.0</VT>
</Values>
</ChannelData>
</LocationData>
</LocationList>
'''
data = []
root = ET.fromstring(xml)
for ld in root.findall('.//LocationData'):
location = ld.attrib['LocationId']
for cd in ld.findall('ChannelData'):
channel = cd.attrib['channelId']
for vt in cd.findall('.//VT'):
data.append({'t': vt.attrib['t'],'VT': vt.text,'channelId':channel,'LocationId':location})
df = pd.DataFrame(data)
print(df)
输出
t VT channelId LocationId
0 2020-08-18T20:30:00 3.2 1 123
1 2020-08-18T21:30:00 3.3 1 123
2 2020-08-18T22:30:00 3.2 1 123
3 2020-08-18T20:30:00 21.7 2 123
4 2020-08-18T21:30:00 21.8 2 123
5 2020-08-18T22:30:00 22.0 2 123
一个选项是使路径通用,从 LocationData 节点开始尽可能多地捕获:
(pd
.read_xml(xml, xpath='.//*')
.dropna(how = 'all', axis = 1)
.assign(channelId = lambda df: df.channelId.ffill(),
LocationId = lambda df: df.LocationId.ffill(),
name = lambda df: df.name.ffill())
.dropna(subset='t')
)
LocationId name channelId VT t
3 123.0 Temperature 1.0 3.2 2020-08-18T20:30:00
4 123.0 Temperature 1.0 3.3 2020-08-18T21:30:00
5 123.0 Temperature 1.0 3.2 2020-08-18T22:30:00
8 123.0 Speed 2.0 21.7 2020-08-18T20:30:00
9 123.0 Speed 2.0 21.8 2020-08-18T21:30:00
10 123.0 Speed 2.0 22.0 2020-08-18T22:30:00