按两列和累计总和分组,回溯 window 日期为 6 个月
group by two columns and cumulative sum with lookback window of 6 months on date
原始数据集
userId createDate grade
0 2016-05-08 22:00:49.673 2
0 2016-07-23 12:37:11.570 7
0 2017-01-03 12:05:33.060 7
1009 2016-06-27 09:28:19.677 5
1009 2016-07-23 12:37:11.570 8
1009 2017-01-03 12:05:33.060 9
1009 2017-02-08 16:17:17.547 4
2011 2016-11-03 14:30:25.390 6
2011 2016-12-15 21:06:14.730 11
2011 2017-01-04 20:22:31.423 2
2011 2017-08-08 16:17:17.547 7
我想要每个用户的成绩总和,回溯 window 从创建日期起 6 个月,即(从创建日期起不到 6 个月的所有成绩总和)
预期:
userId createDate
0 2016-05-08 22:00:49.673 2
2016-07-23 12:37:11.570 9
2017-01-03 12:05:33.060 14
1009 2016-06-27 09:28:19.677 5
2016-07-23 12:37:11.570 13
2017-01-03 12:05:33.060 17
2017-02-08 16:17:17.547 13
2011 2016-11-03 14:30:25.390 6
2016-12-15 21:06:14.730 17
2017-01-04 20:22:31.423 19
2017-08-08 16:17:17.547 7
我目前的尝试是错误的:
df.groupby(['userId','createDate'])['grade'].mean().groupby([pd.Grouper(level='userId'),pd.TimeGrouper('6M', level='createDate', closed = 'left')]).cumsum()
它给了我以下结果:
userId createDate
0 2016-05-08 22:00:49.673 2
2016-07-23 12:37:11.570 9
2017-01-03 12:05:33.060 7
1009 2016-06-27 09:28:19.677 5
2016-07-23 12:37:11.570 13
2017-01-03 12:05:33.060 9
2017-02-08 16:17:17.547 13
2011 2016-11-03 14:30:25.390 6
2016-12-15 21:06:14.730 17
2017-01-04 20:22:31.423 19
2017-08-08 16:17:17.547 7
在 apply
中使用 groupby
和 rolling sum
,偏移量为 180D
,而不是 6 个月,因为月份中的天数往往每个连续的月份都会发生变化。滚动 window 必须是一个常数,即
df.groupby(['userId'])['createDate','grade'].apply(lambda x : x.set_index('createDate').rolling('180D').sum())
grade
userId createDate
0 2016-05-08 22:00:49.673 2.0
2016-07-23 12:37:11.570 9.0
2017-01-03 12:05:33.060 14.0
1009 2016-06-27 09:28:19.677 5.0
2016-07-23 12:37:11.570 13.0
2017-01-03 12:05:33.060 17.0
2017-02-08 16:17:17.547 13.0
2011 2016-11-03 14:30:25.390 6.0
2016-12-15 21:06:14.730 17.0
2017-01-04 20:22:31.423 19.0
2017-08-08 16:17:17.547 7.0
编辑评论:
要回顾 6 个月前的日期,需要对这些日期进行排序。所以也许你可能需要 sort_values
df.groupby(['userId'])['createDate','grade'].apply(lambda x : \
x.sort_values('createDate').set_index('createDate').rolling('180D').sum())
根据@coldspeed 的评论进行编辑:
使用 apply 有点矫枉过正,先设置索引然后使用滚动求和:
df.set_index('createDate').groupby('userId').grade.rolling('180D').sum()
时间:
df = pd.concat([df]*1000)
%%timeit
df.set_index('createDate').groupby('userId').grade.rolling('180D').sum()
100 loops, best of 3: 7.55 ms per loop
%%timeit
df.groupby(['userId'])['createDate','grade'].apply(lambda x : x.sort_values('createDate').set_index('createDate').rolling('180D').sum())
10 loops, best of 3: 19.5 ms per loop
原始数据集
userId createDate grade
0 2016-05-08 22:00:49.673 2
0 2016-07-23 12:37:11.570 7
0 2017-01-03 12:05:33.060 7
1009 2016-06-27 09:28:19.677 5
1009 2016-07-23 12:37:11.570 8
1009 2017-01-03 12:05:33.060 9
1009 2017-02-08 16:17:17.547 4
2011 2016-11-03 14:30:25.390 6
2011 2016-12-15 21:06:14.730 11
2011 2017-01-04 20:22:31.423 2
2011 2017-08-08 16:17:17.547 7
我想要每个用户的成绩总和,回溯 window 从创建日期起 6 个月,即(从创建日期起不到 6 个月的所有成绩总和) 预期:
userId createDate
0 2016-05-08 22:00:49.673 2
2016-07-23 12:37:11.570 9
2017-01-03 12:05:33.060 14
1009 2016-06-27 09:28:19.677 5
2016-07-23 12:37:11.570 13
2017-01-03 12:05:33.060 17
2017-02-08 16:17:17.547 13
2011 2016-11-03 14:30:25.390 6
2016-12-15 21:06:14.730 17
2017-01-04 20:22:31.423 19
2017-08-08 16:17:17.547 7
我目前的尝试是错误的:
df.groupby(['userId','createDate'])['grade'].mean().groupby([pd.Grouper(level='userId'),pd.TimeGrouper('6M', level='createDate', closed = 'left')]).cumsum()
它给了我以下结果:
userId createDate
0 2016-05-08 22:00:49.673 2
2016-07-23 12:37:11.570 9
2017-01-03 12:05:33.060 7
1009 2016-06-27 09:28:19.677 5
2016-07-23 12:37:11.570 13
2017-01-03 12:05:33.060 9
2017-02-08 16:17:17.547 13
2011 2016-11-03 14:30:25.390 6
2016-12-15 21:06:14.730 17
2017-01-04 20:22:31.423 19
2017-08-08 16:17:17.547 7
在 apply
中使用 groupby
和 rolling sum
,偏移量为 180D
,而不是 6 个月,因为月份中的天数往往每个连续的月份都会发生变化。滚动 window 必须是一个常数,即
df.groupby(['userId'])['createDate','grade'].apply(lambda x : x.set_index('createDate').rolling('180D').sum())
grade
userId createDate
0 2016-05-08 22:00:49.673 2.0
2016-07-23 12:37:11.570 9.0
2017-01-03 12:05:33.060 14.0
1009 2016-06-27 09:28:19.677 5.0
2016-07-23 12:37:11.570 13.0
2017-01-03 12:05:33.060 17.0
2017-02-08 16:17:17.547 13.0
2011 2016-11-03 14:30:25.390 6.0
2016-12-15 21:06:14.730 17.0
2017-01-04 20:22:31.423 19.0
2017-08-08 16:17:17.547 7.0
编辑评论:
要回顾 6 个月前的日期,需要对这些日期进行排序。所以也许你可能需要 sort_values
df.groupby(['userId'])['createDate','grade'].apply(lambda x : \
x.sort_values('createDate').set_index('createDate').rolling('180D').sum())
根据@coldspeed 的评论进行编辑:
使用 apply 有点矫枉过正,先设置索引然后使用滚动求和:
df.set_index('createDate').groupby('userId').grade.rolling('180D').sum()
时间:
df = pd.concat([df]*1000)
%%timeit
df.set_index('createDate').groupby('userId').grade.rolling('180D').sum()
100 loops, best of 3: 7.55 ms per loop
%%timeit
df.groupby(['userId'])['createDate','grade'].apply(lambda x : x.sort_values('createDate').set_index('createDate').rolling('180D').sum())
10 loops, best of 3: 19.5 ms per loop