获取 pandas 中的累计总和

Get cumulative sum in pandas

上下文

Datetime Campaign_name Status Open_time
2022-03-15 00:00 Funny_campaign Open
2022-03-15 01:00 Funny_campaign Continue
2022-03-15 02:00 Funny_campaign Continue
2022-03-15 03:00 Funny_campaign Continue
2022-03-15 04:00 Funny_campaign Close
2022-03-15 08:00 Funny_campaign Open
2022-03-15 09:00 Funny_campaign Continue
2022-03-15 10:00 Funny_campaign Close

问题

我需要计算从打开到关闭的时间。

我现在的代码

我可以采用两种方法。获取每个 'Close' 中的打开时间或每个 'Open' 和 'Continue' 中的累积 open_time。这是我对最后一个的看法。

我现在的代码几乎没问题,它没有计算关闭和打开之间的时间,但它忘记了对最后一个时间差求和。

df["Datetime"] = pd.to_datetime(df["Datetime"])
df["time_diff"] = df["Datetime"].diff()
df["time_diff"] = df["time_diff"].astype("timedelta64[m]").fillna(0)
condition = df["Status"] == "Close"
df.loc[condition, "time_diff"] = 0
df["Cumulative time"] = df.groupby(["Campaign_name"])["time_diff"].cumsum()
df = df.drop("time_diff", 1)

IIUC,你可以在 opens 上创建新的组并使用:

df['Datetime'] = pd.to_datetime(df['Datetime'])

group = df['Status'].eq('Open').cumsum()

df['Open_time'] = df.groupby(group)['Datetime'].apply(lambda g: g-g.iloc[0])
# or, alternative syntax
# df['Open_time'] = df.groupby(group)['Datetime'].apply(lambda g: g.diff().cumsum())

输出:

             Datetime   Campaign_name    Status       Open_time
0 2022-03-15 00:00:00  Funny_campaign      Open 0 days 00:00:00
1 2022-03-15 01:00:00  Funny_campaign  Continue 0 days 01:00:00
2 2022-03-15 02:00:00  Funny_campaign  Continue 0 days 02:00:00
3 2022-03-15 03:00:00  Funny_campaign  Continue 0 days 03:00:00
4 2022-03-15 04:00:00  Funny_campaign     Close 0 days 04:00:00
5 2022-03-15 08:00:00  Funny_campaign      Open 0 days 00:00:00
6 2022-03-15 09:00:00  Funny_campaign  Continue 0 days 01:00:00
7 2022-03-15 10:00:00  Funny_campaign     Close 0 days 02:00:00

或仅分配给“关闭”:

df.loc[df['Status'].eq('Close'), 'Open_time'] = df.groupby(group)['Datetime'].apply(lambda g: g-g.iloc[0])

输出:

             Datetime   Campaign_name    Status        Open_time
0 2022-03-15 00:00:00  Funny_campaign      Open              NaN
1 2022-03-15 01:00:00  Funny_campaign  Continue              NaN
2 2022-03-15 02:00:00  Funny_campaign  Continue              NaN
3 2022-03-15 03:00:00  Funny_campaign  Continue              NaN
4 2022-03-15 04:00:00  Funny_campaign     Close  0 days 04:00:00
5 2022-03-15 08:00:00  Funny_campaign      Open              NaN
6 2022-03-15 09:00:00  Funny_campaign  Continue              NaN
7 2022-03-15 10:00:00  Funny_campaign     Close  0 days 02:00:00

每组的差异 close-open:

df.groupby(group)['Datetime'].agg(lambda g: g.iloc[-1]-g.iloc[0])

输出:

Status
1   0 days 04:00:00
2   0 days 02:00:00
Name: Datetime, dtype: timedelta64[ns]