PANDAS 中具有特定条件的累积计数

Question

我有一个组织数据集，每个组织都有多年。我在 Whosebug 上查看了很多 cumulative sum 和 groupby 答案，但似乎找不到适合我情况的答案。

我有兴趣计算每个组织拥有新的活动计划的累计年数。该程序由 Program 列中的值“1”表示。我想要得到的是下面显示的新列 Years_NEW_Program。

OrgID   Year    Program     Years_NEW_Program       event_window
3128    2015    0           0                       -2
3128    2016    0           0                       -1
3128    2017    1           1                       0
3128    2018    1           2                       1
11502   2015    1           0                       
11502   2016    1           0
31530   2009    0           0                       -2  
31530   2010    0           0                       -1
31530   2011    1           1                       0   
31530   2012    1           2                       1   
31530   2013    1           3                       2   
31530   2014    0           0
99      2014    1           0     
99      2015    0           0
99      2016    1           0   
99      2017    0           0
99      2018    0           0

它的独特之处在于，我只希望 'count' 在组织前几年没有该计划时启动（如计划[=32 下的“0”所示） =])，然后实现它（如 Program 下的“1”所示）。我还想仅在组织 至少两年 的 '0' 在启动程序之前开始计数，并将程序保留为至少两年——这就是为什么上面的 ID 99 没有收到计数。

理想情况下，对于那些获得非Years_NEW_Program 中的零值。但如果需要，我可以只使用 Years_NEW_Program.

感谢您的帮助！

Answer 1

这是一个（公认的冗长）方法来做到这一点。首先，为每个 OrgID 创建一个单独的数据框，这样更容易处理。稍后，您将它们连接在一起。对于这些数据帧中的每一个，根据您的条件创建 "startCounter" 和 "stopCounter"。然后添加一列 "counting"，它应该代表计数器应该打开的时间。有一个reset计算累计和的函数，应该没问题。

import pandas as pd
import numpy as np

df = pd.read_csv('file.csv')

def cumsumWithReset(df):
    # Make the cumulative sum of the column "counting"
    # When the value of "counting" is zero, then reset the cumulative sum
    prevVal = 0
    df["cumsum"] = 0
    for index, row in df.iterrows():
        cumsum = row["counting"] + prevVal
        if row['counting'] == 0:
            cumsum = 0
        prevVal = cumsum
        df.loc[index, 'cumsum'] = cumsum
    return df


df = df.sort_values(by="OrgID", ascending = True)
orgList = df.OrgID.drop_duplicates()
dfList = []
for org in orgList:
    dfOrg = df[df["OrgID"] == org]
    dfOrg = dfOrg.sort_values(by="Year", ascending = True).reset_index(drop=True)
    dfOrg['program1Ybefore'] = dfOrg["Program"].shift(periods=1, fill_value = 1)
    dfOrg['program2Ybefore'] = dfOrg["Program"].shift(2, fill_value = 1)
    dfOrg['startCounter'] = (dfOrg['program1Ybefore'] == 0) & (dfOrg['program2Ybefore'] == 0) & (dfOrg['Program'] == 1)
    dfOrg['stopCounter'] =  dfOrg["Program"] == 0
    dfOrg['counting'] =  np.where(dfOrg['startCounter'] & ~dfOrg['stopCounter'],1,np.NaN)
    dfOrg['counting'] =  np.where(dfOrg['stopCounter'],0,dfOrg['counting'])
    dfOrg['counting'] =  dfOrg['counting'].ffill(axis = 0).fillna(0) 
    dfOrg = cumsumWithReset(dfOrg)
    dfList.append(dfOrg)

dfResult = pd.concat(dfList).reset_index(drop=True)

编辑大 df: 不要为每个组织循环单独的数据框，而是创建一个不同的标志来跟踪不断变化的组织。

df = df.sort_values(by=["OrgID", "Year"], ascending = [True, True])
df["newOrg"] = df["OrgID"] != df["OrgID"].shift(1)
df["newOrgShift"] = df["newOrg"].shift(1, fill_value = True)

df['program1Ybefore'] = df["Program"].shift(periods=1, fill_value = 1)
df['program1Ybefore'] = np.where(df["newOrg"],1,df['program1Ybefore'])
df['program2Ybefore'] = df["Program"].shift(2, fill_value = 1)
df['program2Ybefore'] = np.where((df["newOrg"]) | (df["newOrgShift"]) ,1,df['program2Ybefore'])


df['startCounter'] = (df['program1Ybefore'] == 0) & (df['program2Ybefore'] == 0) & (df['Program'] == 1)
df['stopCounter'] =  (df["Program"] == 0) | (df["newOrg"])
df['counting'] =  np.where(df['startCounter'] & ~df['stopCounter'],1,np.NaN)
df['counting'] =  np.where(df['stopCounter'],0,df['counting'])
df['counting'] =  df['counting'].ffill(axis = 0).fillna(0) 

df = cumsumWithReset(df)

Answer 2

@braml1 的答案有效，但这里我提供了一个替代方案，可以调整一些内容。首先，这是替代解决方案：

df = df.sort_values(by=['OrgID', 'Year'], ascending = True)
df['startCounter'] = df.groupby('OrgID')['Program'].apply(lambda x: 
                          ((x.shift(1)==0)&(x.shift(2) == 0) & (x == 1))).values
df['stopCounter'] = df.groupby('OrgID')['Total_Fees_for_Services_binary'].apply(lambda x: x==0).values
df['counting'] = np.where(df['startCounter'] & ~df['stopCounter'],1,np.NaN)
df['counting'] = np.where(df['stopCounter'], 0, df['counting'])
df['counting'] = df.groupby('OrgID')['counting'].ffill().fillna(0) 
a = df.groupby('OrgID')['counting'].fillna(0).eq(1)
b = a.cumsum()
df['cumsum'] = b-b.where(~a).ffill().fillna(0).astype(int)

以下是主要区别。首先，我按 OrgID 和 Year:

排序

df = df.sort_values(by=['OrgID', 'Year'], ascending = True)

然后 startCounter 和 stopCounter 我的不同之处在于合并了 groupby 语句：

df['startCounter'] = df.groupby('OrgID')['Program'].apply(lambda x: 
                      ((x.shift(1)==0)&(x.shift(2) == 0) & (x == 1))).values
df['stopCounter'] = df.groupby('OrgID)['Total_Fees_for_Services_binary'].apply(lambda x: x==0).values

使用这些命令，我可以跳过创建两步中间变量 program1Ybefore 和 program2Ybefore.

接下来，创建 counting 变量的前两行与@braml1 的回答相同：

df['counting'] = np.where(df['startCounter'] & ~df['stopCounter'],1,np.NaN)
df['counting'] = np.where(df['stopCounter'], 0, df['counting'])

不过，第三行再次包含 groupby:

df['counting'] = df.groupby('OrgID')['counting'].ffill().fillna(0)

不过，最大的变化出现在最后一步，即创建 cumsum 变量。在这里，我受到了不同的

的启发

具体来说，我没有应用@braml1 的 cumsumWithReset 函数（它在数据帧的所有行上使用循环），而是应用累积和并在特定条件下重置遇见了。首先，a 将二进制 (0/1) 列 counting 转换为 True/False 列。回顾一下，counting 列是指示存在有效 'new program' 的所有行的列 - 对于这些行，我们需要一个累积总和。

a = df.groupby('OrgID')['counting'].fillna(0).eq(1)

b 然后对 a

中的值求和

b = a.cumsum()

最后，我们给新变量cumsum赋值，值为b where condition a 成立，否则为零（然后用零向前填充列，直到我们再次找到 a）：

df['cumsum'] = b-b.where(~a).ffill().fillna(0).astype(int)

这是最后一步，真正有助于提高性能。通过不执行 cumsumWithReset 函数中的 iterrows，我们可以真正加快性能——尤其是对于大型数据集。

再次感谢@braml1 的帮助。您的解决方案有效！我的替代解决方案只是一些渐进式改进。

PANDAS 中具有特定条件的累积计数

Cumulative Count with Specific Condition in PANDAS

python

cumulative-sum

pandas

pandas-groupby