如何过滤数据框中每个组包含特定值的行之前的行

how do I filter rows that come before the row that contains certain value for each group in dataframe

如何为每个 client_id 仅获取 'action_type' 列中 'click' 之后的行 玩具数据。

df = pd.DataFrame({
  'user_client_id': [1,1, 1, 1, 1,1, 1,1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
   'timestamp':['2021-12-18 09:15:59', '2021-12-18 10:33:49', '2021-12-18 10:34:08',
'2021-12-18 10:34:09', '2021-12-18 10:57:02','2021-12-18 10:57:33','2021-12-18 10:58:01','2021-12-18 10:58:02','2021-12-18 10:58:17',
'2021-12-18 10:58:29','2021-12-18 10:58:31','2021-12-18 10:58:34', '2021-12-18 10:58:34','2021-12-18 10:58:47', '2021-12-18 10:59:12',
'2021-12-18 10:59:28','2021-12-18 10:59:35','2021-12-18 10:59:38','2021-12-18 11:05:13', '2021-12-18 11:05:58','2021-12-18 11:06:08','2021-12-18 11:06:10','2021-12-18 11:06:12','2021-12-18 11:07:42',
 '2021-12-18 11:10:07','2021-12-18 11:10:23', '2021-12-18 11:10:53', '2021-12-18 11:10:58', '2021-12-18 11:13:04', '2021-12-18 11:13:06',
'2021-12-18 14:56:32','2021-12-18 17:16:40'],
'action_type ': ['to_cart','to_cart','to_cart','to_cart','click', 'to_cart', 'to_cart', 'increment', 'remove', 'to_cart', 'increment', 'click', 'to_cart', 'increment', 'to_cart', 'to_cart', 'remove', 'to_cart', 'increment', 'to_cart', 'to_cart', 'click', 'increment',
 'to_cart', 'to_cart', 'to_cart', 'click', 'increment', 'to_cart', 'increment', 'to_cart', 'increment'] })

对于 ID 为 1 的客户端,应过滤在 2021-12-18 10:57:02 点击之前出现的所有内容 对于 ID 为 2 的客户端,应过滤 2021-12-18 11:06:10 点击之前出现的所有内容

这个方法我试过了,只对客户端1有效,对客户端2无效

df.iloc[df.loc[df['action_type']=='click'].index[0]:,:]

任何时候您说“每个客户”,都是您需要 groupby 的好兆头。至于过滤掉第一次点击前的行,你可以计算累计点击次数然后只得到点击次数> 0的行:

def filter(group):
    click = group['action_type'].eq('click').cumsum()
    return group[click > 0]

df.groupby('user_client_id').apply(filter).reset_index(level=0, drop=True)

使用boolean mask:

m = df.groupby('user_client_id')['action_type'] \
      .apply(lambda x: x.eq('click').cumsum().astype(bool))

out = df[m]

输出:

>>> out
    user_client_id            timestamp action_type
4                1  2021-12-18 10:57:02       click
5                1  2021-12-18 10:57:33     to_cart
6                1  2021-12-18 10:58:01     to_cart
7                1  2021-12-18 10:58:02   increment
8                1  2021-12-18 10:58:17      remove
9                1  2021-12-18 10:58:29     to_cart
10               1  2021-12-18 10:58:31   increment
11               1  2021-12-18 10:58:34       click
12               1  2021-12-18 10:58:34     to_cart
13               1  2021-12-18 10:58:47   increment
14               1  2021-12-18 10:59:12     to_cart
21               2  2021-12-18 11:06:10       click
22               2  2021-12-18 11:06:12   increment
23               2  2021-12-18 11:07:42     to_cart
24               2  2021-12-18 11:10:07     to_cart
25               2  2021-12-18 11:10:23     to_cart
26               2  2021-12-18 11:10:53       click
27               2  2021-12-18 11:10:58   increment
28               2  2021-12-18 11:13:04     to_cart
29               2  2021-12-18 11:13:06   increment
30               2  2021-12-18 14:56:32     to_cart
31               2  2021-12-18 17:16:40   increment

布尔掩码:

>>> pd.concat([df, m], axis=1)
    user_client_id            timestamp  action_type  action_type
0                1  2021-12-18 09:15:59      to_cart        False
1                1  2021-12-18 10:33:49      to_cart        False
2                1  2021-12-18 10:34:08      to_cart        False
3                1  2021-12-18 10:34:09      to_cart        False
4                1  2021-12-18 10:57:02        click         True
5                1  2021-12-18 10:57:33      to_cart         True
6                1  2021-12-18 10:58:01      to_cart         True
7                1  2021-12-18 10:58:02    increment         True
8                1  2021-12-18 10:58:17       remove         True
9                1  2021-12-18 10:58:29      to_cart         True
10               1  2021-12-18 10:58:31    increment         True
11               1  2021-12-18 10:58:34        click         True
12               1  2021-12-18 10:58:34      to_cart         True
13               1  2021-12-18 10:58:47    increment         True
14               1  2021-12-18 10:59:12      to_cart         True
15               2  2021-12-18 10:59:28      to_cart        False
16               2  2021-12-18 10:59:35       remove        False
17               2  2021-12-18 10:59:38      to_cart        False
18               2  2021-12-18 11:05:13    increment        False
19               2  2021-12-18 11:05:58      to_cart        False
20               2  2021-12-18 11:06:08      to_cart        False
21               2  2021-12-18 11:06:10        click         True
22               2  2021-12-18 11:06:12    increment         True
23               2  2021-12-18 11:07:42      to_cart         True
24               2  2021-12-18 11:10:07      to_cart         True
25               2  2021-12-18 11:10:23      to_cart         True
26               2  2021-12-18 11:10:53        click         True
27               2  2021-12-18 11:10:58    increment         True
28               2  2021-12-18 11:13:04      to_cart         True
29               2  2021-12-18 11:13:06    increment         True
30               2  2021-12-18 14:56:32      to_cart         True
31               2  2021-12-18 17:16:40    increment         True

您可以使用带有 groupbycummax 的掩码。这将在第一次“点击”后将每组的所有值设置为 True

m = (df['action_type'].eq('click')
       .groupby(df['user_client_id'])
       .cummax()
     )

df[m]

输出:

    user_client_id            timestamp action_type
4                1  2021-12-18 10:57:02       click
5                1  2021-12-18 10:57:33     to_cart
6                1  2021-12-18 10:58:01     to_cart
7                1  2021-12-18 10:58:02   increment
8                1  2021-12-18 10:58:17      remove
9                1  2021-12-18 10:58:29     to_cart
10               1  2021-12-18 10:58:31   increment
11               1  2021-12-18 10:58:34       click
12               1  2021-12-18 10:58:34     to_cart
13               1  2021-12-18 10:58:47   increment
14               1  2021-12-18 10:59:12     to_cart
21               2  2021-12-18 11:06:10       click
22               2  2021-12-18 11:06:12   increment
23               2  2021-12-18 11:07:42     to_cart
24               2  2021-12-18 11:10:07     to_cart
25               2  2021-12-18 11:10:23     to_cart
26               2  2021-12-18 11:10:53       click
27               2  2021-12-18 11:10:58   increment
28               2  2021-12-18 11:13:04     to_cart
29               2  2021-12-18 11:13:06   increment
30               2  2021-12-18 14:56:32     to_cart
31               2  2021-12-18 17:16:40   increment