在特定值之后删除 pandas 数据框中的行（循环时？）

Question

我有一个包含在线商店用户历史记录的数据框。示例：

In [1]:   a = pd.DataFrame([[1, 'view', 'a'], [1, 'cart', 'b'], [2, 'cart','b'], [2, 'cart','c'], [2, 'view','d'], 
                 [2, 'purchase','d'], [2, 'view','e'], [2, 'cart','e']],
                columns=['user_session', 'event_type', 'product_id'])

In [2]: df

Out[2]: 
   user_session  event_type    product_id
0  1             view            a
1  1             cart            b
2  2             cart            b
3  2             cart            c
4  2             view            d
5  2             purchase        d
6  2             view            e
7  2             cart            e

可以多买亲一个user_session。我需要在第一次购买后立即删除会话中的所有其他行。我在这里找到的部分解决方案：它是：

df.loc[:(df['event_type'] == 'purchase').idxmax()]

但我需要遍历一个包含数百万行的庞大数据集。在这里使用for循环是个好主意吗？这应该是更好的机会。

另一种方法可能是建立一个我想删除的行的索引列表，如下所述：dropping a row while iterating through pandas dataframe


for i in df.index:
    ....
    if {make your decision here}:
        indexes_to_drop.append(i)
    ....

df.drop(df.index[indexes_to_drop], inplace=True )

不过，请问还有其他方法吗？

非常感谢！

Answer 1

您可以检查条件，然后在组内第一次出现后使用 cummax 将条件设置为 True。然后我们对 DataFrame 进行切片：

mask = ~(a['event_type'].eq('purchase').groupby(a['user_session']).cummax())

a[mask]
#   user_session event_type product_id
#0             1       view          a
#1             1       cart          b
#2             2       cart          b
#3             2       cart          c
#4             2       view          d

或者，如果您还需要保留购买行，请使用两个 groupby，第二个要换档：

mask = ~(a['event_type'].eq('purchase')
          .groupby(a['user_session']).cummax()
          .groupby(a['user_session']).shift()
          .fillna(False))

a[mask]
#   user_session event_type product_id
#0             1       view          a
#1             1       cart          b
#2             2       cart          b
#3             2       cart          c
#4             2       view          d
#5             2   purchase          d

Answer 2

尝试：

to_remove = (a['event_type'].eq('purchase')
                .groupby(a['user_session'])
                .apply(lambda x: x.shift(fill_value=0).cumsum())
            )
a[to_remove == 0]

输出：

   user_session event_type product_id
0             1       view          a
1             1       cart          b
2             2       cart          b
3             2       cart          c
4             2       view          d
5             2   purchase          d

如果您不想要第一个 purchase 事件，请将 apply(lambda ...) 替换为 .cumsum()

在特定值之后删除 pandas 数据框中的行（循环时？）

Remove rows in pandas dataframe after certain value (while for looping?)

python

session

dataframe

pandas