计算特定范围内的出现次数
Count occurrences within a specific range
我有一个如下所示的数据框:
Tag
0 skip_1
1 run
2 skip_1
3 run
4 skip_1
5 run
6 skip_2
7 run
8 skip_1
9 run
10 skip_2
11 jump
12 skip_1
13 run
14 skip_2
15 jump
16 skip_1
17 run
18 skip_2
19 cleanup_jump
20 skip_1
21 run
22 skip_2
23 run
24 skip_2
25 jump
26 skip_1
27 run
28 skip_2
29 jump
首先,我想统计两次JUMP事件之间出现的运行次,然后在这个范围内从最近到最早枚举这次事件。预期结果为:
Tag Jump_Run_Count Run_Order
0 skip_1 0 0
1 run 0 5
2 skip_1 0 0
3 run 0 4
4 skip_1 0 0
5 run 0 3
6 skip_2 0 0
7 run 0 2
8 skip_1 0 0
9 run 0 1
10 skip_2 0 0
11 jump 5 0
12 skip_1 0 0
13 run 0 1
14 skip_2 0 0
15 jump 1 0
16 skip_1 0 0
17 run 0 0
18 skip_2 0 0
19 cleanup_jump 0 0
20 skip_1 0 0
21 run 0 2
22 skip_2 0 0
23 run 0 1
24 skip_2 0 0
25 jump 2 0
26 skip_1 0 0
27 run 0 1
28 skip_2 0 0
29 jump 1 0
这里的一个问题是第一个 运行 出现不在 2 JUMP 内,而是在第一个 JUMP 和列的开头之间。
其次,我想对 CLEANUP_JUMP 和 JUMP 范围进行相同的计数和枚举,并将其存储在单独的列中。
Tag Jump_Run_Count Run_Order Cleanup_Jump_Dig_Count Run_Order2
0 skip_1 0 0 0 0
1 run 0 5 0 0
2 skip_1 0 0 0 0
3 run 0 4 0 0
4 skip_1 0 0 0 0
5 run 0 3 0 0
6 skip_2 0 0 0 0
7 run 0 2 0 0
8 skip_1 0 0 0 0
9 run 0 1 0 0
10 skip_2 0 0 0 0
11 jump 5 0 0 0
12 skip_1 0 0 0 0
13 run 0 1 0 0
14 skip_2 0 0 0 0
15 jump 1 0 0 0
16 skip_1 0 0 0 0
17 run 0 0 0 1
18 skip_2 0 0 0 0
19 cleanup_jump 0 0 1 0
20 skip_1 0 0 0 0
21 run 0 2 0 0
22 skip_2 0 0 0 0
23 run 0 1 0 0
24 skip_2 0 0 0 0
25 jump 2 0 0 0
26 skip_1 0 0 0 0
27 run 0 1 0 0
28 skip_2 0 0 0 0
29 jump 1 0 0 0
我添加了一些可能更好地解释它的图片:
Scenario1
Scenario2
任何有关如何编码的帮助,或者甚至是解决此问题的其他方法,我们都将不胜感激。
谢谢!
这是一个使用 pandas 的解决方案:
import pandas as pd
import numpy as np
df['run'] = df['Tag'] == 'run'
val_mask = df['Tag'].replace({'cleanup_jump':'jump'}) == 'jump'
df['tag_id'] = (val_mask).cumsum()
df.loc[val_mask, 'Jump_Count'] = df.groupby('tag_id')['run'].sum().to_numpy()[:-1]
df.loc[df['run'], 'run_per_jump'] = df.loc[df['run']].groupby('tag_id')['run'].cumsum()
df['Jump_Run_Order'] = df.groupby('tag_id')['run_per_jump'].rank(method='dense', ascending=False)
jumps_idx = np.flatnonzero(df['Tag'] == 'jump')
cj_idxs = np.flatnonzero(df['Tag'] == 'cleanup_jump')
cj_help_idxs = np.asarray([np.max(jumps_idx[jumps_idx < cj_idx]) for cj_idx in cj_idxs])
for start, end in zip(cj_help_idxs+1, cj_idxs):
df.loc[start:end, 'Cleanup_Jump_Count'] = df.loc[start:end, 'Jump_Count']
df.loc[start:end, 'Cleanup_Jump_Run_Order'] = df.loc[start:end, 'Jump_Run_Order']
df.loc[start:end, 'Jump_Run_Order'] = 0
df.loc[start:end, 'Jump_Count'] = 0
df = df.drop(columns=['tag_id', 'run', 'run_per_jump']).fillna(0).convert_dtypes(convert_integer=True)
print(df)
Tag Jump_Count Jump_Run_Order Cleanup_Jump_Run_Order Cleanup_Jump_Count
0 skip_1 0 0 0 0
1 run 0 5 0 0
2 skip_1 0 0 0 0
3 run 0 4 0 0
4 skip_1 0 0 0 0
5 run 0 3 0 0
6 skip_2 0 0 0 0
7 run 0 2 0 0
8 skip_1 0 0 0 0
9 run 0 1 0 0
10 skip_2 0 0 0 0
11 jump 5 0 0 0
12 skip_1 0 0 0 0
13 run 0 1 0 0
14 skip_2 0 0 0 0
15 jump 1 0 0 0
16 skip_1 0 0 0 0
17 run 0 0 1 0
18 skip_2 0 0 0 0
19 cleanup_jump 0 0 0 1
20 skip_1 0 0 0 0
21 run 0 2 0 0
22 skip_2 0 0 0 0
23 run 0 1 0 0
24 skip_2 0 0 0 0
25 jump 2 0 0 0
26 skip_1 0 0 0 0
27 run 0 1 0 0
28 skip_2 0 0 0 0
29 jump 1 0 0 0
我有一个如下所示的数据框:
Tag
0 skip_1
1 run
2 skip_1
3 run
4 skip_1
5 run
6 skip_2
7 run
8 skip_1
9 run
10 skip_2
11 jump
12 skip_1
13 run
14 skip_2
15 jump
16 skip_1
17 run
18 skip_2
19 cleanup_jump
20 skip_1
21 run
22 skip_2
23 run
24 skip_2
25 jump
26 skip_1
27 run
28 skip_2
29 jump
首先,我想统计两次JUMP事件之间出现的运行次,然后在这个范围内从最近到最早枚举这次事件。预期结果为:
Tag Jump_Run_Count Run_Order
0 skip_1 0 0
1 run 0 5
2 skip_1 0 0
3 run 0 4
4 skip_1 0 0
5 run 0 3
6 skip_2 0 0
7 run 0 2
8 skip_1 0 0
9 run 0 1
10 skip_2 0 0
11 jump 5 0
12 skip_1 0 0
13 run 0 1
14 skip_2 0 0
15 jump 1 0
16 skip_1 0 0
17 run 0 0
18 skip_2 0 0
19 cleanup_jump 0 0
20 skip_1 0 0
21 run 0 2
22 skip_2 0 0
23 run 0 1
24 skip_2 0 0
25 jump 2 0
26 skip_1 0 0
27 run 0 1
28 skip_2 0 0
29 jump 1 0
这里的一个问题是第一个 运行 出现不在 2 JUMP 内,而是在第一个 JUMP 和列的开头之间。
其次,我想对 CLEANUP_JUMP 和 JUMP 范围进行相同的计数和枚举,并将其存储在单独的列中。
Tag Jump_Run_Count Run_Order Cleanup_Jump_Dig_Count Run_Order2
0 skip_1 0 0 0 0
1 run 0 5 0 0
2 skip_1 0 0 0 0
3 run 0 4 0 0
4 skip_1 0 0 0 0
5 run 0 3 0 0
6 skip_2 0 0 0 0
7 run 0 2 0 0
8 skip_1 0 0 0 0
9 run 0 1 0 0
10 skip_2 0 0 0 0
11 jump 5 0 0 0
12 skip_1 0 0 0 0
13 run 0 1 0 0
14 skip_2 0 0 0 0
15 jump 1 0 0 0
16 skip_1 0 0 0 0
17 run 0 0 0 1
18 skip_2 0 0 0 0
19 cleanup_jump 0 0 1 0
20 skip_1 0 0 0 0
21 run 0 2 0 0
22 skip_2 0 0 0 0
23 run 0 1 0 0
24 skip_2 0 0 0 0
25 jump 2 0 0 0
26 skip_1 0 0 0 0
27 run 0 1 0 0
28 skip_2 0 0 0 0
29 jump 1 0 0 0
我添加了一些可能更好地解释它的图片:
Scenario1
Scenario2
任何有关如何编码的帮助,或者甚至是解决此问题的其他方法,我们都将不胜感激。
谢谢!
这是一个使用 pandas 的解决方案:
import pandas as pd
import numpy as np
df['run'] = df['Tag'] == 'run'
val_mask = df['Tag'].replace({'cleanup_jump':'jump'}) == 'jump'
df['tag_id'] = (val_mask).cumsum()
df.loc[val_mask, 'Jump_Count'] = df.groupby('tag_id')['run'].sum().to_numpy()[:-1]
df.loc[df['run'], 'run_per_jump'] = df.loc[df['run']].groupby('tag_id')['run'].cumsum()
df['Jump_Run_Order'] = df.groupby('tag_id')['run_per_jump'].rank(method='dense', ascending=False)
jumps_idx = np.flatnonzero(df['Tag'] == 'jump')
cj_idxs = np.flatnonzero(df['Tag'] == 'cleanup_jump')
cj_help_idxs = np.asarray([np.max(jumps_idx[jumps_idx < cj_idx]) for cj_idx in cj_idxs])
for start, end in zip(cj_help_idxs+1, cj_idxs):
df.loc[start:end, 'Cleanup_Jump_Count'] = df.loc[start:end, 'Jump_Count']
df.loc[start:end, 'Cleanup_Jump_Run_Order'] = df.loc[start:end, 'Jump_Run_Order']
df.loc[start:end, 'Jump_Run_Order'] = 0
df.loc[start:end, 'Jump_Count'] = 0
df = df.drop(columns=['tag_id', 'run', 'run_per_jump']).fillna(0).convert_dtypes(convert_integer=True)
print(df)
Tag Jump_Count Jump_Run_Order Cleanup_Jump_Run_Order Cleanup_Jump_Count
0 skip_1 0 0 0 0
1 run 0 5 0 0
2 skip_1 0 0 0 0
3 run 0 4 0 0
4 skip_1 0 0 0 0
5 run 0 3 0 0
6 skip_2 0 0 0 0
7 run 0 2 0 0
8 skip_1 0 0 0 0
9 run 0 1 0 0
10 skip_2 0 0 0 0
11 jump 5 0 0 0
12 skip_1 0 0 0 0
13 run 0 1 0 0
14 skip_2 0 0 0 0
15 jump 1 0 0 0
16 skip_1 0 0 0 0
17 run 0 0 1 0
18 skip_2 0 0 0 0
19 cleanup_jump 0 0 0 1
20 skip_1 0 0 0 0
21 run 0 2 0 0
22 skip_2 0 0 0 0
23 run 0 1 0 0
24 skip_2 0 0 0 0
25 jump 2 0 0 0
26 skip_1 0 0 0 0
27 run 0 1 0 0
28 skip_2 0 0 0 0
29 jump 1 0 0 0