如何计算按两个变量分组的过去 n 年的平均值

Question

首先，我想说我查看了[这个答案][1]，但我无法继续使用那里的信息。

所以我有这样一个数据集

df = pd.DataFrame({'ID': [10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013],
'Type': ['Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue'],
'Year': [2018,2018,2019,2019,2020,2020,2021,2021,2022,2022,2018,2018,2019,2019,2021,2021,2018,2018,2019,2019,2020,2020,2021,2021,2022,2022,2018,2018,2019,2019,2021,2021],
'Score': [0,0,0,0,0,0,0,0,0,0,14,24,16,5,87,33,0,0,0,0,0,0,0,0,0,0,11,13,3,16,37,49]})

我不知道如何输入 NaN，所以我这样写：

df.replace(0, np.nan, inplace=True)
df = df.dropna(axis=0, subset=['Score'])

我想按 ID 和主题获取滚动 N（在本例中为 3）年平均分数。

我可以通过ID获取3年滚动平均分：

df['average_past_3_years'] = df.groupby(['ID'], as_index = False).rolling(3).agg( {'Score':'mean', 'Year': 'max'}).reset_index(level=0).groupby( 'Year').transform('shift')['Score'] df = df.sort_values(['ID', 'Year'])

这给了我滚动平均值，但只是按 ID 的年份，而不是主题：

ID     Type   Year  Score   average_past_3_years
10010   Red   2018  14.0    NaN
10010   Blue  2018  24.0    NaN
10010   Red   2019  16.0    NaN
10010   Blue  2019  5.0     18.000000
10010   Red   2021  87.0    NaN
10010   Blue  2021  33.0    36.000000
10013   Red   2018  11.0    NaN
10013   Blue  2018  13.0    NaN
10013   Red   2019  3.0     15.000000
10013   Blue  2019  16.0    9.000000
10013   Red   2021  37.0    41.666667
10013   Blue  2021  49.0    18.666667

我试图输出 10013 表示红色，2021 年将是 17，而不是 41，因为它只会计算红色分数。

我试过了：

df['average_past_3_years'] = df.groupby([['ID', 'Type']], as_index = False).rolling(3).agg(
                      {'Score':'mean', 'Year': 'max'}).reset_index(level=0).groupby(
                      'Year').transform('shift')['Score']

但出现此错误：

Grouper and axis must be same length

有点卡在那里

我也不确定是否需要事先排序

  [1]:

Answer 1

IIUC 使用：

df1 = df.groupby(['ID','Type']).rolling(3).agg( {'Score':'mean', 'Year': 'max'}).droplevel([0,1])

df = df.join(df1.add_suffix('_r'))
df['Score_r'] = df.groupby('Year')['Score_r'].shift()

如何计算按两个变量分组的过去 n 年的平均值

how to calculate mean values of the past n years grouped by two variables

python

group-by

pandas