如何对混合类型的 Pandas 数据帧进行重采样?
How to resample a Pandas dataframe of mixed type?
我使用以下 Python 代码生成混合类型(浮点数和字符串)Pandas DataFrame df3:
df1 = pd.DataFrame(np.random.randn(dates.shape[0],2),index=dates,columns=list('AB'))
df1['C'] = 'A'
df1['D'] = 'Pickles'
df2 = pd.DataFrame(np.random.randn(dates.shape[0], 2),index=dates,columns=list('AB'))
df2['C'] = 'B'
df2['D'] = 'Ham'
df3 = pd.concat([df1, df2], axis=0)
当我将 df3 重新采样到更高的频率时,我没有将帧重新采样到更高的频率,但是如何被忽略,我只是得到缺失值:
df4 = df3.groupby(['C']).resample('M', how={'A': 'mean', 'B': 'mean', 'D': 'ffill'})
df4.head()
结果:
B A D
C
A 2014-03-31 -0.4640906 -0.2435414 Pickles
2014-04-30 NaN NaN NaN
2014-05-31 NaN NaN NaN
2014-06-30 -0.5626360 0.6679614 Pickles
2014-07-31 NaN NaN NaN
当我将 df3 重新采样到较低频率时,我根本没有进行任何重新采样:
df5 = df3.groupby(['C']).resample('A', how={'A': np.mean, 'B': np.mean, 'D': 'ffill'})
df5.head()
结果:
B A D
C
A 2014-03-31 NaN NaN Pickles
2014-06-30 NaN NaN Pickles
2014-09-30 NaN NaN Pickles
2014-12-31 -0.7429617 -0.1065645 Pickles
2015-03-31 NaN NaN Pickles
我很确定这与混合类型有关,因为如果我仅使用数字列重做年度下采样,一切都会按预期进行:
df5b = df3[['A', 'B', 'C']].groupby(['C']).resample('A', how={'A': np.mean, 'B': np.mean})
df5b.head()
结果:
B A
C
A 2014-12-31 -0.7429617 -0.1065645
2015-12-31 -0.6245030 -0.3101057
B 2014-12-31 0.4213621 -0.0708263
2015-12-31 -0.0607028 0.0110456
但即使我切换到数字类型,对更高频率的重采样仍然无法按我的预期工作:
df4b = df3[['A', 'B', 'C']].groupby(['C']).resample('M', how={'A': 'mean', 'B': 'mean'})
df4b.head()
结果:
B A
C
A 2014-03-31 -0.4640906 -0.2435414
2014-04-30 NaN NaN
2014-05-31 NaN NaN
2014-06-30 -0.5626360 0.6679614
2014-07-31 NaN NaN
这让我有两个问题:
- 对混合类型的数据帧重新采样的正确方法是什么?
- 当从较低频率重采样到较高频率时,进行重采样以便插入新值的正确方法是什么?
即使您不能对两个部分都提供完整的答案,也可以提供部分解决方案或对任一问题的答案。
当从较低频率重新采样到较高频率时,我意识到当我想指定 fill_method[ 时,我指定了 how =25=]。当我这样做时,事情似乎有效。
df4c = df3.groupby(['C']).resample('M', fill_method='ffill')
df4c.head()
A B D
C
A 2014-03-31 -0.2435414 -0.4640906 Pickles
2014-04-30 -0.2435414 -0.4640906 Pickles
2014-05-31 -0.2435414 -0.4640906 Pickles
2014-06-30 0.6679614 -0.5626360 Pickles
2014-07-31 0.6679614 -0.5626360 Pickles
您获得的插值选择集更加有限,但它确实可以处理混合类型。
当不使用 how 选项重新采样到较低频率时(我相信它的默认意思是)down-sampling 确实有效:
df5c =df3.groupby(['C']).resample('A')
df5c.head()
A B
C
A 2014-12-31 -0.1065645 -0.7429617
2015-12-31 -0.3101057 -0.6245030
B 2014-12-31 -0.0708263 0.4213621
2015-12-31 0.0110456 -0.0607028
因此,问题似乎出在传递 how 选项的字典或选项选择之一,大概是 ffill,但我不确定。
使用resample
和agg
自pandas-1.0.0
以来,how
and fill_method
keywords no longer exist。
此外,resample
方法现在 returns a Resampler
object.
解决方案是使用与每一列关联的函数或函数名称来定义聚合规则。
df.resample(period).agg(aggregation_rule)
聚合规则的更多示例in the documentation。
工作示例
准备测试数据:
import numpy as np
import pandas as pd
dates = pd.date_range("2021-02-09", "2021-04-09", freq="1D")
df1 = pd.DataFrame(np.random.randn(dates.shape[0],2), index=dates, columns=list('AB'))
df1['C'] = 'A'
df1['D'] = 'Pickles'
df2 = pd.DataFrame(np.random.randn(dates.shape[0], 2), index=dates, columns=list('AB'))
df2['C'] = 'B'
df2['D'] = 'Ham'
df3 = pd.concat([df1, df2], axis=0)
print(df3)
输出:
A B C D
2021-02-09 2.591285 2.455686 A Pickles
2021-02-10 0.753461 -0.072643 A Pickles
2021-02-11 -0.351667 -0.025511 A Pickles
2021-02-12 -0.896730 0.004512 A Pickles
2021-02-13 -0.493139 -0.770514 A Pickles
... ... ... .. ...
2021-04-05 1.615935 1.152517 B Ham
2021-04-06 -0.067654 -0.858186 B Ham
2021-04-07 0.085587 -0.848542 B Ham
2021-04-08 -0.371983 0.088441 B Ham
2021-04-09 0.681501 0.235328 B Ham
[120 rows x 4 columns]
每月重新采样:
agg_rules = { "A": "mean", "B": "sum", "C": "first", "D": "last",}
df4 = df3.resample("M").agg(agg_rules)
print(df4)
输出:
A B C D
2021-02-28 0.025987 3.886781 A Ham
2021-03-31 0.081423 -5.492928 A Ham
2021-04-30 0.239309 -3.344334 A Ham
我使用以下 Python 代码生成混合类型(浮点数和字符串)Pandas DataFrame df3:
df1 = pd.DataFrame(np.random.randn(dates.shape[0],2),index=dates,columns=list('AB'))
df1['C'] = 'A'
df1['D'] = 'Pickles'
df2 = pd.DataFrame(np.random.randn(dates.shape[0], 2),index=dates,columns=list('AB'))
df2['C'] = 'B'
df2['D'] = 'Ham'
df3 = pd.concat([df1, df2], axis=0)
当我将 df3 重新采样到更高的频率时,我没有将帧重新采样到更高的频率,但是如何被忽略,我只是得到缺失值:
df4 = df3.groupby(['C']).resample('M', how={'A': 'mean', 'B': 'mean', 'D': 'ffill'})
df4.head()
结果:
B A D
C
A 2014-03-31 -0.4640906 -0.2435414 Pickles
2014-04-30 NaN NaN NaN
2014-05-31 NaN NaN NaN
2014-06-30 -0.5626360 0.6679614 Pickles
2014-07-31 NaN NaN NaN
当我将 df3 重新采样到较低频率时,我根本没有进行任何重新采样:
df5 = df3.groupby(['C']).resample('A', how={'A': np.mean, 'B': np.mean, 'D': 'ffill'})
df5.head()
结果:
B A D
C
A 2014-03-31 NaN NaN Pickles
2014-06-30 NaN NaN Pickles
2014-09-30 NaN NaN Pickles
2014-12-31 -0.7429617 -0.1065645 Pickles
2015-03-31 NaN NaN Pickles
我很确定这与混合类型有关,因为如果我仅使用数字列重做年度下采样,一切都会按预期进行:
df5b = df3[['A', 'B', 'C']].groupby(['C']).resample('A', how={'A': np.mean, 'B': np.mean})
df5b.head()
结果:
B A
C
A 2014-12-31 -0.7429617 -0.1065645
2015-12-31 -0.6245030 -0.3101057
B 2014-12-31 0.4213621 -0.0708263
2015-12-31 -0.0607028 0.0110456
但即使我切换到数字类型,对更高频率的重采样仍然无法按我的预期工作:
df4b = df3[['A', 'B', 'C']].groupby(['C']).resample('M', how={'A': 'mean', 'B': 'mean'})
df4b.head()
结果:
B A
C
A 2014-03-31 -0.4640906 -0.2435414
2014-04-30 NaN NaN
2014-05-31 NaN NaN
2014-06-30 -0.5626360 0.6679614
2014-07-31 NaN NaN
这让我有两个问题:
- 对混合类型的数据帧重新采样的正确方法是什么?
- 当从较低频率重采样到较高频率时,进行重采样以便插入新值的正确方法是什么?
即使您不能对两个部分都提供完整的答案,也可以提供部分解决方案或对任一问题的答案。
当从较低频率重新采样到较高频率时,我意识到当我想指定 fill_method[ 时,我指定了 how =25=]。当我这样做时,事情似乎有效。
df4c = df3.groupby(['C']).resample('M', fill_method='ffill')
df4c.head()
A B D
C
A 2014-03-31 -0.2435414 -0.4640906 Pickles
2014-04-30 -0.2435414 -0.4640906 Pickles
2014-05-31 -0.2435414 -0.4640906 Pickles
2014-06-30 0.6679614 -0.5626360 Pickles
2014-07-31 0.6679614 -0.5626360 Pickles
您获得的插值选择集更加有限,但它确实可以处理混合类型。
当不使用 how 选项重新采样到较低频率时(我相信它的默认意思是)down-sampling 确实有效:
df5c =df3.groupby(['C']).resample('A')
df5c.head()
A B
C
A 2014-12-31 -0.1065645 -0.7429617
2015-12-31 -0.3101057 -0.6245030
B 2014-12-31 -0.0708263 0.4213621
2015-12-31 0.0110456 -0.0607028
因此,问题似乎出在传递 how 选项的字典或选项选择之一,大概是 ffill,但我不确定。
使用resample
和agg
自pandas-1.0.0
以来,how
and fill_method
keywords no longer exist。
此外,resample
方法现在 returns a Resampler
object.
解决方案是使用与每一列关联的函数或函数名称来定义聚合规则。
df.resample(period).agg(aggregation_rule)
聚合规则的更多示例in the documentation。
工作示例
准备测试数据:
import numpy as np
import pandas as pd
dates = pd.date_range("2021-02-09", "2021-04-09", freq="1D")
df1 = pd.DataFrame(np.random.randn(dates.shape[0],2), index=dates, columns=list('AB'))
df1['C'] = 'A'
df1['D'] = 'Pickles'
df2 = pd.DataFrame(np.random.randn(dates.shape[0], 2), index=dates, columns=list('AB'))
df2['C'] = 'B'
df2['D'] = 'Ham'
df3 = pd.concat([df1, df2], axis=0)
print(df3)
输出:
A B C D
2021-02-09 2.591285 2.455686 A Pickles
2021-02-10 0.753461 -0.072643 A Pickles
2021-02-11 -0.351667 -0.025511 A Pickles
2021-02-12 -0.896730 0.004512 A Pickles
2021-02-13 -0.493139 -0.770514 A Pickles
... ... ... .. ...
2021-04-05 1.615935 1.152517 B Ham
2021-04-06 -0.067654 -0.858186 B Ham
2021-04-07 0.085587 -0.848542 B Ham
2021-04-08 -0.371983 0.088441 B Ham
2021-04-09 0.681501 0.235328 B Ham
[120 rows x 4 columns]
每月重新采样:
agg_rules = { "A": "mean", "B": "sum", "C": "first", "D": "last",}
df4 = df3.resample("M").agg(agg_rules)
print(df4)
输出:
A B C D
2021-02-28 0.025987 3.886781 A Ham
2021-03-31 0.081423 -5.492928 A Ham
2021-04-30 0.239309 -3.344334 A Ham