如何将 Pandas 多索引数据帧移动到 Xarray DataArray
How to move Pandas multi-index dataframe to Xarray DataArray
我正在将 CSV 文件导入 Pandas 数据框。 CSV 文件类似于:
Time, Status, Variable, freq_1, freq_2, freq_3, .....
1/1/2000, Hi, A, 0.1, 3.3, 8.1, ....
1/1/2000, Hi, B, 2.4, 1.2, 1.3, ....
1/1/2000, Lo, A, 4.5, 6.9, 6.4, ....
1/1/2000, Lo, B, 7.1, 8.8, 2.3, ....
2/1/2000, Hi, A, 0.1, 3.3, 8.1, ....
2/1/2000, Hi, B, 2.4, 1.2, 1.3, ....
2/1/2000, Lo, A, 4.5, 6.9, 6.4, ....
2/1/2000, Lo, B, 7.1, 8.8, 2.3, ....
....
我用时间、状态和变量作为指标将其读入具有多索引的数据帧。
我现在想使用 Pandas to_xarray 或 Xarray from_dataframe 将数据帧导入 Xarray。但是,这两种方法似乎都在索引上阻塞,抛出错误:
TypeError: Could not convert tuple of form (dims, data[, attrs, encoding]): (0, DatetimeIndex(['2018-01-12 00:15:00', '2018-01-12 00:45:00',
'2018-01-12 01:15:00', '2018-01-12 01:45:00',
'2018-01-12 02:15:00', '2018-01-12 02:45:00',
'2018-01-12 03:15:00', '2018-01-12 03:45:00',
'2018-01-12 04:15:00', '2018-01-12 04:45:00',
...
'2019-11-01 16:15:00', '2019-11-01 17:15:00',
'2019-11-01 17:45:00', '2019-11-01 18:15:00',
'2019-11-01 18:45:00', '2019-11-01 19:15:00',
'2019-11-01 20:45:00', '2019-11-01 21:15:00',
'2019-11-01 21:45:00', '2019-11-01 22:15:00'],
dtype='datetime64[ns]', name=0, length=3100, freq=None)) to Variable.
我也尝试过使用 Xarray.DataArray 程序:
Mytime = PD.to_datetime(df[0],infer_datetime_format=True)
Myfreq = np.array([ 0,1,2,3...39])
OutDataArray = Xarray.DataArray(df[np.arange(3,43)], coords=[('time', Mytime ), ('freq', Myfreq ), ('Status', df[1]), ('variable', df[2])])
但这给出了错误:
ValueError: coords is not dict-like, but it has 4 items, which does not match the 2 dimensions of the data
那么,如果数据帧是二维的,那么如何将 Pandas 数据帧导入 Xarray,但这些维度之一(即行)实际上由数据帧的多索引指定的多个维度组成?
根据要求,这是一个重现该问题的示例脚本。请注意,您需要为导入的示例数据的 CSV 文件设置文件名:
import numpy as np
import pandas as PD
# create some data
dt = PD.date_range(start='01/01/2000 00:00:00', end='02/01/2000 00:00:00', freq='30T')
ldt = len(dt)
vr1 = PD.Series(np.empty(ldt, dtype = np.str))
vr2 = PD.Series(np.empty(ldt, dtype = np.str))
vr3 = PD.Series(np.empty(ldt, dtype = np.str))
vr1.values[:] = 'apple'
vr2.values[:] = 'orange'
vr3.values[:] = 'peach'
# combine the data to mimic my file format
a = PD.Series([1,2,3,4], index=[7,2,8,9])
b = PD.Series([5,6,7,8], index=[7,2,8,9])
df1 = PD.DataFrame({'Time': dt,'Fruit':vr1, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df2 = PD.DataFrame({'Time': dt,'Fruit':vr2, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df3 = PD.DataFrame({'Time': dt,'Fruit':vr3, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df_unsorted = PD.concat([df1, df2, df3])
df = df_unsorted.sort_values(by=['Time', 'Fruit'])
# write the data to a csv file
filename = 'Give a file path/name here'
df.to_csv(filename, index=False)
#import the csv file and convert to an xarray
df2 = PD.read_csv(filename, sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
da = df2.to_xarray()
您的错误似乎在于您的 csv 文件中的列和索引未在生成的 DataFrame 中命名。将代码示例的最后两行替换为:
df2 = PD.read_csv(filename, sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
df2.columns = ['N1', 'N2', 'N3']
df2.index.names = ['time', 'fruit']
ds = df2.to_xarray()
导致成功转换为 xarray 数据集。
print(ds)
<xarray.Dataset>
Dimensions: (fruit: 3, time: 1489)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T00:30:00 ... 2000-02-01
* fruit (fruit) object 'apple' 'orange' 'peach'
Data variables:
N1 (time, fruit) float64 0.114 0.3726 0.5072 ... 0.2065 0.9082 0.7945
N2 (time, fruit) float64 0.7534 0.1107 0.8866 ... 0.4509 0.5218 0.1472
N3 (time, fruit) float64 0.156 0.6498 0.3521 ... 0.3742 0.5899 0.607
更新: 您可以通过删除 PD.read_csv()
中的 skiprows=1
和 header=None
参数来跳过手动设置列和索引名称,获取来自 csv header 的列名。所以你的最后两行看起来像:
df2 = PD.read_csv(filename, sep=',', skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
ds = df2.to_xarray()
我正在将 CSV 文件导入 Pandas 数据框。 CSV 文件类似于:
Time, Status, Variable, freq_1, freq_2, freq_3, .....
1/1/2000, Hi, A, 0.1, 3.3, 8.1, ....
1/1/2000, Hi, B, 2.4, 1.2, 1.3, ....
1/1/2000, Lo, A, 4.5, 6.9, 6.4, ....
1/1/2000, Lo, B, 7.1, 8.8, 2.3, ....
2/1/2000, Hi, A, 0.1, 3.3, 8.1, ....
2/1/2000, Hi, B, 2.4, 1.2, 1.3, ....
2/1/2000, Lo, A, 4.5, 6.9, 6.4, ....
2/1/2000, Lo, B, 7.1, 8.8, 2.3, ....
....
我用时间、状态和变量作为指标将其读入具有多索引的数据帧。
我现在想使用 Pandas to_xarray 或 Xarray from_dataframe 将数据帧导入 Xarray。但是,这两种方法似乎都在索引上阻塞,抛出错误:
TypeError: Could not convert tuple of form (dims, data[, attrs, encoding]): (0, DatetimeIndex(['2018-01-12 00:15:00', '2018-01-12 00:45:00',
'2018-01-12 01:15:00', '2018-01-12 01:45:00',
'2018-01-12 02:15:00', '2018-01-12 02:45:00',
'2018-01-12 03:15:00', '2018-01-12 03:45:00',
'2018-01-12 04:15:00', '2018-01-12 04:45:00',
...
'2019-11-01 16:15:00', '2019-11-01 17:15:00',
'2019-11-01 17:45:00', '2019-11-01 18:15:00',
'2019-11-01 18:45:00', '2019-11-01 19:15:00',
'2019-11-01 20:45:00', '2019-11-01 21:15:00',
'2019-11-01 21:45:00', '2019-11-01 22:15:00'],
dtype='datetime64[ns]', name=0, length=3100, freq=None)) to Variable.
我也尝试过使用 Xarray.DataArray 程序:
Mytime = PD.to_datetime(df[0],infer_datetime_format=True)
Myfreq = np.array([ 0,1,2,3...39])
OutDataArray = Xarray.DataArray(df[np.arange(3,43)], coords=[('time', Mytime ), ('freq', Myfreq ), ('Status', df[1]), ('variable', df[2])])
但这给出了错误:
ValueError: coords is not dict-like, but it has 4 items, which does not match the 2 dimensions of the data
那么,如果数据帧是二维的,那么如何将 Pandas 数据帧导入 Xarray,但这些维度之一(即行)实际上由数据帧的多索引指定的多个维度组成?
根据要求,这是一个重现该问题的示例脚本。请注意,您需要为导入的示例数据的 CSV 文件设置文件名:
import numpy as np
import pandas as PD
# create some data
dt = PD.date_range(start='01/01/2000 00:00:00', end='02/01/2000 00:00:00', freq='30T')
ldt = len(dt)
vr1 = PD.Series(np.empty(ldt, dtype = np.str))
vr2 = PD.Series(np.empty(ldt, dtype = np.str))
vr3 = PD.Series(np.empty(ldt, dtype = np.str))
vr1.values[:] = 'apple'
vr2.values[:] = 'orange'
vr3.values[:] = 'peach'
# combine the data to mimic my file format
a = PD.Series([1,2,3,4], index=[7,2,8,9])
b = PD.Series([5,6,7,8], index=[7,2,8,9])
df1 = PD.DataFrame({'Time': dt,'Fruit':vr1, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df2 = PD.DataFrame({'Time': dt,'Fruit':vr2, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df3 = PD.DataFrame({'Time': dt,'Fruit':vr3, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df_unsorted = PD.concat([df1, df2, df3])
df = df_unsorted.sort_values(by=['Time', 'Fruit'])
# write the data to a csv file
filename = 'Give a file path/name here'
df.to_csv(filename, index=False)
#import the csv file and convert to an xarray
df2 = PD.read_csv(filename, sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
da = df2.to_xarray()
您的错误似乎在于您的 csv 文件中的列和索引未在生成的 DataFrame 中命名。将代码示例的最后两行替换为:
df2 = PD.read_csv(filename, sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
df2.columns = ['N1', 'N2', 'N3']
df2.index.names = ['time', 'fruit']
ds = df2.to_xarray()
导致成功转换为 xarray 数据集。
print(ds)
<xarray.Dataset>
Dimensions: (fruit: 3, time: 1489)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T00:30:00 ... 2000-02-01
* fruit (fruit) object 'apple' 'orange' 'peach'
Data variables:
N1 (time, fruit) float64 0.114 0.3726 0.5072 ... 0.2065 0.9082 0.7945
N2 (time, fruit) float64 0.7534 0.1107 0.8866 ... 0.4509 0.5218 0.1472
N3 (time, fruit) float64 0.156 0.6498 0.3521 ... 0.3742 0.5899 0.607
更新: 您可以通过删除 PD.read_csv()
中的 skiprows=1
和 header=None
参数来跳过手动设置列和索引名称,获取来自 csv header 的列名。所以你的最后两行看起来像:
df2 = PD.read_csv(filename, sep=',', skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
ds = df2.to_xarray()