算出一列中两个值之间的差异,同时保持在另一列的边界内?

Workout the difference between two values in a column while remaining in the bounds of another column?

我有一个数据框,我正在尝试计算两个不同主题之间的时差,同时保持在一个通话中而不是溢出到一个新的通话中(即同时确保它不会计算出不同通话中主题之间的时差).其中 interaction_id 是一个单独的调用

这是一个示例数据框

df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])

      interaction_id    start_time     topic 
           1               2           Cost
           1              5.72          NaN
           1              8.83         Billing
           1              12.86         NaN
           2               2            Cost
           2              6.75          NaN
           2              8.54          NaN
           3              1.5          Payments
           3              3.65         Products

这是期望的输出

df2 = pd.DataFrame([[1, 2, 'Cost',6.83], [1, 5.72, NaN, NaN], [1, 8.83, 'Billing',4.03], [1, 12.86, NaN,NaN], [2, 2, 'Cost',6.54], [2, 6.75, NaN, NaN], [2, 8.54, NaN, NaN], [3, 1.5, 'Payments', 2.15],[3, 3.65, 'Products','...']], columns=['interaction_id', 'start_time', 'topic','topic_length'])

       interaction_id    start_time     topic     topic_length

           1               2           Cost           6.83
           1              5.72          NaN           NaN
           1              8.83         Billing        4.03
           1              12.86         NaN           NaN
           2               2            Cost          6.54
           2              6.75          NaN           NaN
           2              8.54          NaN           NaN
           3              1.5          Payments       2.15
           3              3.65         Products       ....

不知道有没有更简单的解决方法,但是这个方法可以解决你的问题:

def custom_agg(group):
    group = group.reset_index(drop=True)
    max_ind = group.shape[0]-1
    current_ind = -1
    current_val = None
    for ind, val in group.iterrows():
        if pd.isna(val.topic) and ind != max_ind:
            continue
        if current_ind == -1:
            current_ind = ind
            current_val = val["start_time"]
        else:
            group.loc[current_ind,"topic_length"] = val["start_time"] - current_val
            current_ind = ind
            current_val = val["start_time"]
    return group
df = df.sort_values(by=['interaction_id', 'start_time']).groupby("interaction_id").apply(custom_agg).reset_index(drop=True)

输出:

    interaction_id  start_time  topic   topic_length
0   1   2.00    Cost    6.83
1   1   5.72    NaN NaN
2   1   8.83    Billing 4.03
3   1   12.86   NaN NaN
4   2   2.00    Cost    6.54
5   2   6.75    NaN NaN
6   2   8.54    NaN NaN
7   3   1.50    Payments    2.15
8   3   3.65    Products    NaN