如何计算一列中两个值之间的差异，同时保留在另一列的边界内？

Question

我有一个数据框，我正在尝试计算两个不同主题之间的时间差，同时保持在一个通话中而不是溢出到一个新的通话中（即同时确保它没有计算出不同通话中主题之间的时间差).其中 interaction_id 是一个单独的调用

这是一个示例数据框

df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])

      interaction_id    start_time     topic 
           1               2           Cost
           1              5.72          NaN
           1              8.83         Billing
           1              12.86         NaN
           2               2            Cost
           2              6.75          NaN
           2              8.54          NaN
           3              1.5          Payments
           3              3.65         Products

这是期望的输出

df2 = pd.DataFrame([[1, 2, 'Cost',6.83], [1, 5.72, NaN, NaN], [1, 8.83, 'Billing',4.03], [1, 12.86, NaN,NaN], [2, 2, 'Cost',6.54], [2, 6.75, NaN, NaN], [2, 8.54, NaN, NaN], [3, 1.5, 'Payments', 2.15],[3, 3.65, 'Products','...']], columns=['interaction_id', 'start_time', 'topic','topic_length])

       interaction_id    start_time     topic     topic_length

           1               2           Cost           6.83
           1              5.72          NaN           NaN
           1              8.83         Billing        4.03
           1              12.86         NaN           NaN
           2               2            Cost          6.54
           2              6.75          NaN           NaN
           2              8.54          NaN           NaN
           3              1.5          Payments       2.15
           3              3.65         Products       ....

我希望这是有道理的

Answer 1

可以试试下面的方法吗？

我正在为每个调用（交互）应用一个函数，然后为每个调用的每个主题分配一个唯一编号 (ngroup)。然后我将呼叫结束分配给它自己的号码 (-1)。然后我用diff来计算话题长度。

import pandas as pd
import numpy as np
from numpy import nan as NaN
df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])
def func(df):
    ngroup_df = pd.DataFrame({"topic":df.ffill()['topic'].drop_duplicates().to_list(),"ngroup":[i for i in range(len(df.ffill()['topic'].drop_duplicates().to_list()))][::-1]})
    df = df.ffill().merge(ngroup_df)
    df.loc[df.index.max(), 'ngroup'] = -1
    length_df = df[['start_time','ngroup']].groupby('ngroup').min().diff().dropna().rename({'start_time':'length'}, axis = 1).reset_index()
    length_df['length'] = length_df['length'].abs()
    df.loc[df.index.max(), 'ngroup'] = 0
    return df.merge(length_df, how = 'left')
>>> print(df.groupby(['interaction_id']).apply(func).reset_index(drop = True))
   interaction_id  start_time     topic  ngroup  length
0               1        2.00      Cost       1    6.83
1               1        5.72      Cost       1    6.83
2               1        8.83   Billing       0    4.03
3               1       12.86   Billing       0    4.03
4               2        2.00      Cost       0    6.54
5               2        6.75      Cost       0    6.54
6               2        8.54      Cost       0    6.54
7               3        1.50  Payments       1    2.15
8               3        3.65  Products       0     NaN

如何计算一列中两个值之间的差异，同时保留在另一列的边界内？

How can I workout the difference between two values in a column while remaining in the bounds of another column?

python

nlp

data-analysis

pandas

data-science