如何按 python 中的连续日期分组?
How to group by consecutive dates in python?
我有这个数据。我想找出哪个 activity 连续发生了多少天:
Id datetime date Hour Activity
0 Abc 2021-04-26 14:30:33 2021-04-26 (12.0, 14.0] login
1 Abc 2021-04-26 12:55:27 2021-04-26 (12.0, 14.0] login
2 Abc 2021-04-26 13:30:31 2021-04-26 (12.0, 14.0] login
3 Abc 2021-04-28 11:55:33 2021-04-28 (10.0, 12.0] login
4 Abc 2021-05-01 08:25:15 2021-05-01 (8.0, 10.0] login
5 Abc 2021-05-01 09:45:01 2021-05-01 (8.0, 10.0] login
6 Abc 2021-05-02 11:05:19 2021-05-02 (10.0, 12.0] login
7 Abc 2021-05-03 02:26:12 2021-05-03 (2.0, 4.0] browsing
8 Abc 2021-05-03 03:59:10 2021-05-03 (2.0, 4.0] browsing
9 Abc 2021-05-03 05:40:00 2021-05-03 (4.0, 6.0] browsing
我尝试对所有连续日期进行分组:
sample['Consecutive'] = sample.groupby('Id').date.diff().dt.days.ne(1).cumsum()
这给我的输出为:
Id datetime date Hour Activity Consecutive
0 Abc 2021-04-26 14:30:33 2021-04-26 (12.0, 14.0] login 1
1 Abc 2021-04-26 12:55:27 2021-04-26 (12.0, 14.0] login 2
2 Abc 2021-04-26 13:30:31 2021-04-26 (12.0, 14.0] login 3
3 Abc 2021-04-28 11:55:33 2021-04-28 (10.0, 12.0] login 4
4 Abc 2021-05-01 08:25:15 2021-05-01 (8.0, 10.0] login 5
5 Abc 2021-05-01 09:45:01 2021-05-01 (8.0, 10.0] login 6
6 Abc 2021-05-02 11:05:19 2021-05-02 (10.0, 12.0] login 6
7 Abc 2021-05-03 02:26:12 2021-05-03 (2.0, 4.0] browsing 6
8 Abc 2021-05-03 03:59:10 2021-05-03 (2.0, 4.0] browsing 7
9 Abc 2021-05-03 05:40:00 2021-05-03 (4.0, 6.0] browsing 8
期望的输出:
Id datetime date Hour Activity Consecutive
0 Abc 2021-04-26 14:30:33 2021-04-26 (12.0, 14.0] login 1
1 Abc 2021-04-26 12:55:27 2021-04-26 (12.0, 14.0] login 1
2 Abc 2021-04-26 13:30:31 2021-04-26 (12.0, 14.0] login 1
3 Abc 2021-04-28 11:55:33 2021-04-28 (10.0, 12.0] login 2
4 Abc 2021-05-01 08:25:15 2021-05-01 (8.0, 10.0] login 3
5 Abc 2021-05-01 09:45:01 2021-05-01 (8.0, 10.0] login 3
6 Abc 2021-05-02 11:05:19 2021-05-02 (10.0, 12.0] login 3
7 Abc 2021-05-03 02:26:12 2021-05-03 (2.0, 4.0] browsing 3
8 Abc 2021-05-03 03:59:10 2021-05-03 (2.0, 4.0] browsing 3
9 Abc 2021-05-03 05:40:00 2021-05-03 (4.0, 6.0] browsing 3
请帮我改正。
如果我理解正确你想要达到的目标,你只需要将 ne(1)
更改为 gt(1)
:
df['Consecutive'] = df.groupby('Id')['date'].diff().dt.days.gt(1).cumsum() + 1
df
输出:
Id datetime date Hour Activity Consecutive
0 Abc 2021-04-26 14:30:33 2021-04-26 (12.0, 14.0] login 1
1 Abc 2021-04-26 12:55:27 2021-04-26 (12.0, 14.0] login 1
2 Abc 2021-04-26 13:30:31 2021-04-26 (12.0, 14.0] login 1
3 Abc 2021-04-28 11:55:33 2021-04-28 (10.0, 12.0] login 2
4 Abc 2021-05-01 08:25:15 2021-05-01 (8.0, 10.0] login 3
5 Abc 2021-05-01 09:45:01 2021-05-01 (8.0, 10.0] login 3
6 Abc 2021-05-02 11:05:19 2021-05-02 (10.0, 12.0] login 3
7 Abc 2021-05-03 02:26:12 2021-05-03 (2.0, 4.0] browsing 3
8 Abc 2021-05-03 03:59:10 2021-05-03 (2.0, 4.0] browsing 3
9 Abc 2021-05-03 05:40:00 2021-05-03 (4.0, 6.0] browsing 3
我有这个数据。我想找出哪个 activity 连续发生了多少天:
Id datetime date Hour Activity
0 Abc 2021-04-26 14:30:33 2021-04-26 (12.0, 14.0] login
1 Abc 2021-04-26 12:55:27 2021-04-26 (12.0, 14.0] login
2 Abc 2021-04-26 13:30:31 2021-04-26 (12.0, 14.0] login
3 Abc 2021-04-28 11:55:33 2021-04-28 (10.0, 12.0] login
4 Abc 2021-05-01 08:25:15 2021-05-01 (8.0, 10.0] login
5 Abc 2021-05-01 09:45:01 2021-05-01 (8.0, 10.0] login
6 Abc 2021-05-02 11:05:19 2021-05-02 (10.0, 12.0] login
7 Abc 2021-05-03 02:26:12 2021-05-03 (2.0, 4.0] browsing
8 Abc 2021-05-03 03:59:10 2021-05-03 (2.0, 4.0] browsing
9 Abc 2021-05-03 05:40:00 2021-05-03 (4.0, 6.0] browsing
我尝试对所有连续日期进行分组:
sample['Consecutive'] = sample.groupby('Id').date.diff().dt.days.ne(1).cumsum()
这给我的输出为:
Id datetime date Hour Activity Consecutive
0 Abc 2021-04-26 14:30:33 2021-04-26 (12.0, 14.0] login 1
1 Abc 2021-04-26 12:55:27 2021-04-26 (12.0, 14.0] login 2
2 Abc 2021-04-26 13:30:31 2021-04-26 (12.0, 14.0] login 3
3 Abc 2021-04-28 11:55:33 2021-04-28 (10.0, 12.0] login 4
4 Abc 2021-05-01 08:25:15 2021-05-01 (8.0, 10.0] login 5
5 Abc 2021-05-01 09:45:01 2021-05-01 (8.0, 10.0] login 6
6 Abc 2021-05-02 11:05:19 2021-05-02 (10.0, 12.0] login 6
7 Abc 2021-05-03 02:26:12 2021-05-03 (2.0, 4.0] browsing 6
8 Abc 2021-05-03 03:59:10 2021-05-03 (2.0, 4.0] browsing 7
9 Abc 2021-05-03 05:40:00 2021-05-03 (4.0, 6.0] browsing 8
期望的输出:
Id datetime date Hour Activity Consecutive
0 Abc 2021-04-26 14:30:33 2021-04-26 (12.0, 14.0] login 1
1 Abc 2021-04-26 12:55:27 2021-04-26 (12.0, 14.0] login 1
2 Abc 2021-04-26 13:30:31 2021-04-26 (12.0, 14.0] login 1
3 Abc 2021-04-28 11:55:33 2021-04-28 (10.0, 12.0] login 2
4 Abc 2021-05-01 08:25:15 2021-05-01 (8.0, 10.0] login 3
5 Abc 2021-05-01 09:45:01 2021-05-01 (8.0, 10.0] login 3
6 Abc 2021-05-02 11:05:19 2021-05-02 (10.0, 12.0] login 3
7 Abc 2021-05-03 02:26:12 2021-05-03 (2.0, 4.0] browsing 3
8 Abc 2021-05-03 03:59:10 2021-05-03 (2.0, 4.0] browsing 3
9 Abc 2021-05-03 05:40:00 2021-05-03 (4.0, 6.0] browsing 3
请帮我改正。
如果我理解正确你想要达到的目标,你只需要将 ne(1)
更改为 gt(1)
:
df['Consecutive'] = df.groupby('Id')['date'].diff().dt.days.gt(1).cumsum() + 1
df
输出:
Id datetime date Hour Activity Consecutive
0 Abc 2021-04-26 14:30:33 2021-04-26 (12.0, 14.0] login 1
1 Abc 2021-04-26 12:55:27 2021-04-26 (12.0, 14.0] login 1
2 Abc 2021-04-26 13:30:31 2021-04-26 (12.0, 14.0] login 1
3 Abc 2021-04-28 11:55:33 2021-04-28 (10.0, 12.0] login 2
4 Abc 2021-05-01 08:25:15 2021-05-01 (8.0, 10.0] login 3
5 Abc 2021-05-01 09:45:01 2021-05-01 (8.0, 10.0] login 3
6 Abc 2021-05-02 11:05:19 2021-05-02 (10.0, 12.0] login 3
7 Abc 2021-05-03 02:26:12 2021-05-03 (2.0, 4.0] browsing 3
8 Abc 2021-05-03 03:59:10 2021-05-03 (2.0, 4.0] browsing 3
9 Abc 2021-05-03 05:40:00 2021-05-03 (4.0, 6.0] browsing 3