Pandas 具有完整性要求的按频率分组
Pandas Grouper by frequency with completeness requirement
我有每月的时间序列数据,由于其他原因,这些数据既缺少一些条目又分散了 NaN 值。我需要将数据汇总到季度和年度系列中,但我不想报告缺少数据的 quarters/years 的数据。例如,在下面的数据中,我不想报告 2014 年第一季度的数据,因为我缺少当年的一月份。
import pandas as pd, numpy as np
df = pd.DataFrame([
('Monthly','2014-02-1', 529.1),
('Monthly','2014-03-1', 67.1),
('Monthly','2014-04-1', np.nan),
('Monthly','2014-05-1', 146.8),
('Monthly','2014-06-1', 469.7),
('Monthly','2014-07-1', 82.9),
('Monthly','2014-08-1', 636.9),
('Monthly','2014-09-1', 520.9),
('Monthly','2014-10-1', 217.4),
('Monthly','2014-11-1', 776.6),
('Monthly','2014-12-1', 18.4),
('Monthly','2015-01-1', 376.7),
('Monthly','2015-02-1', 266.5),
('Monthly','2015-03-1', np.nan),
('Monthly','2015-04-1', 144.1),
('Monthly','2015-05-1', 385.0),
('Monthly','2015-06-1', 527.1),
('Monthly','2015-07-1', 748.5),
('Monthly','2015-08-1', 518.2)],
columns=['Frequency','Date','Value'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(['Frequency','Date'],inplace=True)
df
Value
Frequency Date
2014-02-01 529.1
2014-03-01 67.1
2014-04-01 NaN
2014-05-01 146.8
2014-06-01 469.7
2014-07-01 82.9
2014-08-01 636.9
2014-09-01 520.9
2014-10-01 217.4
2014-11-01 776.6
2014-12-01 18.4
2015-01-01 376.7
2015-02-01 266.5
2015-03-01 NaN
2015-04-01 144.1
2015-05-01 385.0
2015-06-01 527.1
2015-07-01 748.5
2015-08-01 518.2
我尝试过使用 Grouper 函数,但 groupby 忽略了 NaN 值,并且据我所知,Grouper 实用程序不强制执行时间序列完整性:
df.groupby(pd.Grouper(level='Date', freq='Q')).sum()
Value
Date
2014-03-31 1571.2
2014-06-30 616.5
2014-09-30 1240.7
2014-12-31 1012.4
2015-03-31 643.2
2015-06-30 1056.2
2015-09-30 1266.7
我想看的是:
Value
Date
2014-03-31 NaN # Because of missing 2014-01-01
2014-06-30 NaN # Because of NaN in 2014-04-01
2014-09-30 1240.7
2014-12-31 1012.4
2015-03-31 NaN # Because of NaN in 2015-03-01
2015-06-30 1056.2
2015-09-30 NaN # Because of missing 2015-09-01
执行此操作的好方法是什么?
您可以创建一个布尔掩码,该掩码对于恰好有 3 个元素的每个组都为真:
mask = (df.groupby(pd.Grouper(level='Date', freq='Q'))['Value'].count() != 3).values
然后只需将相应的行设置为 NaN。
grouped = df.groupby(pd.Grouper(level='Date', freq='Q'))
result = grouped.sum()
mask = (grouped['Value'].count() != 3).values
result.loc[mask, 'Value'] = np.nan
产量
Value
Date
2014-03-31 NaN
2014-06-30 NaN
2014-09-30 1240.7
2014-12-31 1012.4
2015-03-31 NaN
2015-06-30 1056.2
2015-09-30 NaN
你可能想写自己的聚合函数,1,如果有nan
,return一个nan
; 2、如果周期太短,也returnnan
; 3、否则,return总和:
In [43]:
gpy = df.groupby(pd.Grouper(level='Date', freq='Q'))
print gpy.agg(lambda x: np.nan if (np.isnan(x).any() or len(x)<3) else x.sum())
Value
Date
2014-03-31 NaN
2014-06-30 NaN
2014-09-30 1240.7
2014-12-31 1012.4
2015-03-31 NaN
2015-06-30 1056.2
2015-09-30 NaN
我有每月的时间序列数据,由于其他原因,这些数据既缺少一些条目又分散了 NaN 值。我需要将数据汇总到季度和年度系列中,但我不想报告缺少数据的 quarters/years 的数据。例如,在下面的数据中,我不想报告 2014 年第一季度的数据,因为我缺少当年的一月份。
import pandas as pd, numpy as np
df = pd.DataFrame([
('Monthly','2014-02-1', 529.1),
('Monthly','2014-03-1', 67.1),
('Monthly','2014-04-1', np.nan),
('Monthly','2014-05-1', 146.8),
('Monthly','2014-06-1', 469.7),
('Monthly','2014-07-1', 82.9),
('Monthly','2014-08-1', 636.9),
('Monthly','2014-09-1', 520.9),
('Monthly','2014-10-1', 217.4),
('Monthly','2014-11-1', 776.6),
('Monthly','2014-12-1', 18.4),
('Monthly','2015-01-1', 376.7),
('Monthly','2015-02-1', 266.5),
('Monthly','2015-03-1', np.nan),
('Monthly','2015-04-1', 144.1),
('Monthly','2015-05-1', 385.0),
('Monthly','2015-06-1', 527.1),
('Monthly','2015-07-1', 748.5),
('Monthly','2015-08-1', 518.2)],
columns=['Frequency','Date','Value'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(['Frequency','Date'],inplace=True)
df
Value
Frequency Date
2014-02-01 529.1
2014-03-01 67.1
2014-04-01 NaN
2014-05-01 146.8
2014-06-01 469.7
2014-07-01 82.9
2014-08-01 636.9
2014-09-01 520.9
2014-10-01 217.4
2014-11-01 776.6
2014-12-01 18.4
2015-01-01 376.7
2015-02-01 266.5
2015-03-01 NaN
2015-04-01 144.1
2015-05-01 385.0
2015-06-01 527.1
2015-07-01 748.5
2015-08-01 518.2
我尝试过使用 Grouper 函数,但 groupby 忽略了 NaN 值,并且据我所知,Grouper 实用程序不强制执行时间序列完整性:
df.groupby(pd.Grouper(level='Date', freq='Q')).sum()
Value
Date
2014-03-31 1571.2
2014-06-30 616.5
2014-09-30 1240.7
2014-12-31 1012.4
2015-03-31 643.2
2015-06-30 1056.2
2015-09-30 1266.7
我想看的是:
Value
Date
2014-03-31 NaN # Because of missing 2014-01-01
2014-06-30 NaN # Because of NaN in 2014-04-01
2014-09-30 1240.7
2014-12-31 1012.4
2015-03-31 NaN # Because of NaN in 2015-03-01
2015-06-30 1056.2
2015-09-30 NaN # Because of missing 2015-09-01
执行此操作的好方法是什么?
您可以创建一个布尔掩码,该掩码对于恰好有 3 个元素的每个组都为真:
mask = (df.groupby(pd.Grouper(level='Date', freq='Q'))['Value'].count() != 3).values
然后只需将相应的行设置为 NaN。
grouped = df.groupby(pd.Grouper(level='Date', freq='Q'))
result = grouped.sum()
mask = (grouped['Value'].count() != 3).values
result.loc[mask, 'Value'] = np.nan
产量
Value
Date
2014-03-31 NaN
2014-06-30 NaN
2014-09-30 1240.7
2014-12-31 1012.4
2015-03-31 NaN
2015-06-30 1056.2
2015-09-30 NaN
你可能想写自己的聚合函数,1,如果有nan
,return一个nan
; 2、如果周期太短,也returnnan
; 3、否则,return总和:
In [43]:
gpy = df.groupby(pd.Grouper(level='Date', freq='Q'))
print gpy.agg(lambda x: np.nan if (np.isnan(x).any() or len(x)<3) else x.sum())
Value
Date
2014-03-31 NaN
2014-06-30 NaN
2014-09-30 1240.7
2014-12-31 1012.4
2015-03-31 NaN
2015-06-30 1056.2
2015-09-30 NaN