如何计算按两个变量分组的过去 n 年的平均值
how to calculate mean values of the past n years grouped by two variables
首先,我想说我查看了[这个答案][1],但我无法继续使用那里的信息。
所以我有这样一个数据集
df = pd.DataFrame({'ID': [10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013],
'Type': ['Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue'],
'Year': [2018,2018,2019,2019,2020,2020,2021,2021,2022,2022,2018,2018,2019,2019,2021,2021,2018,2018,2019,2019,2020,2020,2021,2021,2022,2022,2018,2018,2019,2019,2021,2021],
'Score': [0,0,0,0,0,0,0,0,0,0,14,24,16,5,87,33,0,0,0,0,0,0,0,0,0,0,11,13,3,16,37,49]})
我不知道如何输入 NaN,所以我这样写:
df.replace(0, np.nan, inplace=True)
df = df.dropna(axis=0, subset=['Score'])
我想按 ID 和主题获取滚动 N(在本例中为 3)年平均分数。
我可以通过ID获取3年滚动平均分:
df['average_past_3_years'] = df.groupby(['ID'], as_index = False).rolling(3).agg(
{'Score':'mean', 'Year': 'max'}).reset_index(level=0).groupby(
'Year').transform('shift')['Score']
df = df.sort_values(['ID', 'Year'])
这给了我滚动平均值,但只是按 ID 的年份,而不是主题:
ID Type Year Score average_past_3_years
10010 Red 2018 14.0 NaN
10010 Blue 2018 24.0 NaN
10010 Red 2019 16.0 NaN
10010 Blue 2019 5.0 18.000000
10010 Red 2021 87.0 NaN
10010 Blue 2021 33.0 36.000000
10013 Red 2018 11.0 NaN
10013 Blue 2018 13.0 NaN
10013 Red 2019 3.0 15.000000
10013 Blue 2019 16.0 9.000000
10013 Red 2021 37.0 41.666667
10013 Blue 2021 49.0 18.666667
我试图输出 10013 表示红色,2021 年将是 17,而不是 41,因为它只会计算红色分数。
我试过了:
df['average_past_3_years'] = df.groupby([['ID', 'Type']], as_index = False).rolling(3).agg(
{'Score':'mean', 'Year': 'max'}).reset_index(level=0).groupby(
'Year').transform('shift')['Score']
但出现此错误:
Grouper and axis must be same length
有点卡在那里
我也不确定是否需要事先排序
[1]:
IIUC 使用:
df1 = df.groupby(['ID','Type']).rolling(3).agg( {'Score':'mean', 'Year': 'max'}).droplevel([0,1])
df = df.join(df1.add_suffix('_r'))
df['Score_r'] = df.groupby('Year')['Score_r'].shift()
首先,我想说我查看了[这个答案][1],但我无法继续使用那里的信息。
所以我有这样一个数据集
df = pd.DataFrame({'ID': [10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10010,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013,10013],
'Type': ['Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue','Red','Blue'],
'Year': [2018,2018,2019,2019,2020,2020,2021,2021,2022,2022,2018,2018,2019,2019,2021,2021,2018,2018,2019,2019,2020,2020,2021,2021,2022,2022,2018,2018,2019,2019,2021,2021],
'Score': [0,0,0,0,0,0,0,0,0,0,14,24,16,5,87,33,0,0,0,0,0,0,0,0,0,0,11,13,3,16,37,49]})
我不知道如何输入 NaN,所以我这样写:
df.replace(0, np.nan, inplace=True)
df = df.dropna(axis=0, subset=['Score'])
我想按 ID 和主题获取滚动 N(在本例中为 3)年平均分数。
我可以通过ID获取3年滚动平均分:
df['average_past_3_years'] = df.groupby(['ID'], as_index = False).rolling(3).agg( {'Score':'mean', 'Year': 'max'}).reset_index(level=0).groupby( 'Year').transform('shift')['Score'] df = df.sort_values(['ID', 'Year'])
这给了我滚动平均值,但只是按 ID 的年份,而不是主题:
ID Type Year Score average_past_3_years
10010 Red 2018 14.0 NaN
10010 Blue 2018 24.0 NaN
10010 Red 2019 16.0 NaN
10010 Blue 2019 5.0 18.000000
10010 Red 2021 87.0 NaN
10010 Blue 2021 33.0 36.000000
10013 Red 2018 11.0 NaN
10013 Blue 2018 13.0 NaN
10013 Red 2019 3.0 15.000000
10013 Blue 2019 16.0 9.000000
10013 Red 2021 37.0 41.666667
10013 Blue 2021 49.0 18.666667
我试图输出 10013 表示红色,2021 年将是 17,而不是 41,因为它只会计算红色分数。
我试过了:
df['average_past_3_years'] = df.groupby([['ID', 'Type']], as_index = False).rolling(3).agg(
{'Score':'mean', 'Year': 'max'}).reset_index(level=0).groupby(
'Year').transform('shift')['Score']
但出现此错误:
Grouper and axis must be same length
有点卡在那里
我也不确定是否需要事先排序
[1]:
IIUC 使用:
df1 = df.groupby(['ID','Type']).rolling(3).agg( {'Score':'mean', 'Year': 'max'}).droplevel([0,1])
df = df.join(df1.add_suffix('_r'))
df['Score_r'] = df.groupby('Year')['Score_r'].shift()