如何使用 pandas in Python 将多个数据集的数据组织到同一个数据框中？

Question

我在使用 Python 中的 pandas 按照我的意愿在数据框中组织一些数据时遇到问题。

我想要一个单一的数据框，其中的数据将分为三列（例如 Time、V 和 I）。

但是，我希望将不同样本的数据放在同一个数据框中，这样我就可以轻松地 select 来自 Sample#1 或 Sample#2 的数据。

我想到的是这样的：

df1 = pd.DataFrame({'Time': np.arange(0,10,0.5), 'V': np.random.rand(20), 'I': np.random.rand(20)})
df1['Sample']= 'sample_1'

df2 = pd.DataFrame({'Time': np.arange(0,10,0.5), 'V': np.random.rand(20), 'I': np.random.rand(20)})
df2['Sample']= 'sample_2'

df = df1.append(df2)

请注意，我添加了另一个名为 Sample 的列来跟踪哪个数据对应于哪个样本。

但是我不知道如何从 sample_1 或 sample_2 从 df

调用数据

我该怎么做？这是组织数据的正确方法吗？我应该使用 MultiIndex 吗？

Answer 1

是的，MultiIndex 是一种可能的解决方案：

np.random.seed(1)
df1 = pd.DataFrame({'Time': np.arange(0,10,0.5), 
                    'V': np.random.rand(20), 
                    'I': np.random.rand(20)})

np.random.seed(2)
df2 = pd.DataFrame({'Time': np.arange(0,10,0.5), 
                    'V': np.random.rand(20), 
                    'I': np.random.rand(20)})

#print (df1)
#print (df2)

您可以 concat 所有 DataFrame 并在参数 keys 中指定每个来源 DataFrame:

print (pd.concat([df1, df2], keys=('sample_1','sample_2')))
                    I  Time         V
sample_1 0   0.800745   0.0  0.417022
         1   0.968262   0.5  0.720324
         2   0.313424   1.0  0.000114
         3   0.692323   1.5  0.302333
         4   0.876389   2.0  0.146756
         5   0.894607   2.5  0.092339
         6   0.085044   3.0  0.186260
         7   0.039055   3.5  0.345561
         8   0.169830   4.0  0.396767
         9   0.878143   4.5  0.538817
         10  0.098347   5.0  0.419195
         11  0.421108   5.5  0.685220
         12  0.957890   6.0  0.204452
         13  0.533165   6.5  0.878117
         14  0.691877   7.0  0.027388
         15  0.315516   7.5  0.670468
         16  0.686501   8.0  0.417305
         17  0.834626   8.5  0.558690
         18  0.018288   9.0  0.140387
         19  0.750144   9.5  0.198101
sample_2 0   0.505246   0.0  0.435995
         1   0.065287   0.5  0.025926
         2   0.428122   1.0  0.549662
         3   0.096531   1.5  0.435322
         4   0.127160   2.0  0.420368
         5   0.596745   2.5  0.330335
         6   0.226012   3.0  0.204649
         7   0.106946   3.5  0.619271
         8   0.220306   4.0  0.299655
         9   0.349826   4.5  0.266827
         10  0.467787   5.0  0.621134
         11  0.201743   5.5  0.529142
         12  0.640407   6.0  0.134580
         13  0.483070   6.5  0.513578
         14  0.505237   7.0  0.184440
         15  0.386893   7.5  0.785335
         16  0.793637   8.0  0.853975
         17  0.580004   8.5  0.494237
         18  0.162299   9.0  0.846561
         19  0.700752   9.5  0.079645

Select 数据可以通过 xs - see cross section:

print (df.xs('sample_1', level=0))
           I  Time         V
0   0.800745   0.0  0.417022
1   0.968262   0.5  0.720324
2   0.313424   1.0  0.000114
3   0.692323   1.5  0.302333
4   0.876389   2.0  0.146756
5   0.894607   2.5  0.092339
6   0.085044   3.0  0.186260
7   0.039055   3.5  0.345561
8   0.169830   4.0  0.396767
9   0.878143   4.5  0.538817
10  0.098347   5.0  0.419195
11  0.421108   5.5  0.685220
12  0.957890   6.0  0.204452
13  0.533165   6.5  0.878117
14  0.691877   7.0  0.027388
15  0.315516   7.5  0.670468
16  0.686501   8.0  0.417305
17  0.834626   8.5  0.558690
18  0.018288   9.0  0.140387
19  0.750144   9.5  0.198101

如果需要select只有一些列：

print (df.xs('sample_1', level=0)[['Time','I']])
    Time         I
0    0.0  0.800745
1    0.5  0.968262
2    1.0  0.313424
3    1.5  0.692323
4    2.0  0.876389
5    2.5  0.894607
6    3.0  0.085044
7    3.5  0.039055
8    4.0  0.169830
9    4.5  0.878143
10   5.0  0.098347
11   5.5  0.421108
12   6.0  0.957890
13   6.5  0.533165
14   7.0  0.691877
15   7.5  0.315516
16   8.0  0.686501
17   8.5  0.834626
18   9.0  0.018288
19   9.5  0.750144

另一个解决方案是使用 IndexSlice - 参见 using slicers

idx = pd.IndexSlice
print (df.loc[idx['sample_1',:], ['Time','I']])
             Time         I
sample_1 0    0.0  0.800745
         1    0.5  0.968262
         2    1.0  0.313424
         3    1.5  0.692323
         4    2.0  0.876389
         5    2.5  0.894607
         6    3.0  0.085044
         7    3.5  0.039055
         8    4.0  0.169830
         9    4.5  0.878143
         10   5.0  0.098347
         11   5.5  0.421108
         12   6.0  0.957890
         13   6.5  0.533165
         14   7.0  0.691877
         15   7.5  0.315516
         16   8.0  0.686501
         17   8.5  0.834626
         18   9.0  0.018288
         19   9.5  0.750144

如果需要移除第一层Multiindex:

idx = pd.IndexSlice
print (df.loc[idx['sample_1',:], ['Time','I']].reset_index(level=0, drop=True))
    Time         I
0    0.0  0.800745
1    0.5  0.968262
2    1.0  0.313424
3    1.5  0.692323
4    2.0  0.876389
5    2.5  0.894607
6    3.0  0.085044
7    3.5  0.039055
8    4.0  0.169830
9    4.5  0.878143
10   5.0  0.098347
11   5.5  0.421108
12   6.0  0.957890
13   6.5  0.533165
14   7.0  0.691877
15   7.5  0.315516
16   8.0  0.686501
17   8.5  0.834626
18   9.0  0.018288
19   9.5  0.750144

如何使用 pandas in Python 将多个数据集的数据组织到同一个数据框中？

How to organize data of several datasets into the same dataframe using pandas in Python?

python

indexing

multi-index

dataframe

pandas