基于相对于同一 DataFrame 的条件的 DataFrame 列的最大值

Question

我有一个如下所示的 DataFrame：

CUSTOMER_ID MONTH       ACTIVE
123456      2020-01     1
123456      2020-02     0
123456      2020-03     0
123456      2020-04     1
654321      2020-01     1
654321      2020-02     1
654321      2020-03     0
654321      2020-04     0

从该数据到每一行（代表特定客户在该月的表现），我需要添加该特定客户上次活跃的 MONTH，相对于该行的 MONTH。

对于这里的示例数据子集，DataFrame 应该如下所示：

CUSTOMER_ID MONTH       ACTIVE      LAST_TIME_ACTIVE
123456      2020-01     1               2020-01
123456      2020-02     0               2020-01
123456      2020-03     0               2020-01
123456      2020-04     1               2020-04
654321      2020-01     1               2020-01
654321      2020-02     1               2020-02
654321      2020-03     0               2020-02
654321      2020-04     0               2020-02

我尝试了上解释的解决方案，但那里的解决方案会给我一般最大值，它不满足“相对于该行的月份”条件。

最重要的是，我尝试定义函数并使用 .apply() 从我的 DataFrame 中调用它，但它非常慢，因为每次过滤整个 DataFrame - 这是所有操作中成本最高的操作.

函数的定义如下：

def get_last_active_month(dfRow, wholeDF) :
    
    try:
        lastActiveMonth = wholeDF[(wholeDF['CUSTOMER_ID']==dfRow['CUSTOMER_ID']) & (wholeDF['MONTH']<=dfRow['MONTH']) & (wholeDF['ACTIVE']==1)]['MONTH'].item()
    except:
        lastActiveMonth = '2017-12'
    finally:
        return lastActiveMonth;

我正在与 90 000 多个客户打交道，我需要对从 2018 年开始一直到今天的数据应用此逻辑，因此我们讨论的行数确实很多。循环当然是不可能的（我什至尝试过这种绝望的行为，当然它非常慢，而且是非 Pythonic 的“解决方案”）。

我恳请您帮忙寻找 Pythonic 和快速的解决方案。谢谢！

Answer 1

将 pd.Series.where 与 groupby 和 ffill 一起使用：

df["new"] = df["MONTH"].where(df["ACTIVE"].ne(0))

df["new"] = df.groupby("CUSTOMER_ID")["new"].ffill()

print (df)

   CUSTOMER_ID    MONTH  ACTIVE      new
0       123456  2020-01       1  2020-01
1       123456  2020-02       0  2020-01
2       123456  2020-03       0  2020-01
3       123456  2020-04       1  2020-04
4       654321  2020-01       1  2020-01
5       654321  2020-02       1  2020-02
6       654321  2020-03       0  2020-02
7       654321  2020-04       0  2020-02

Answer 2

Pandas 一个（混淆）衬里（假设使用日期类型）：

df['month_last_active'] = df.groupby([df.CUSTOMER_ID, df.groupby('CUSTOMER_ID').ACTIVE.cumsum()]).MONTH.cummin()

基于相对于同一 DataFrame 的条件的 DataFrame 列的最大值

Maximum of the DataFrame column based on the conditions relative to the same DataFrame

python

max

filter

dataframe

pandas