如何计算一列中两个值之间的差异,同时保留在另一列的边界内?
How can I workout the difference between two values in a column while remaining in the bounds of another column?
我有一个数据框,我正在尝试计算两个不同主题之间的时间差,同时保持在一个通话中而不是溢出到一个新的通话中(即同时确保它没有计算出不同通话中主题之间的时间差).其中 interaction_id 是一个单独的调用
这是一个示例数据框
df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])
interaction_id start_time topic
1 2 Cost
1 5.72 NaN
1 8.83 Billing
1 12.86 NaN
2 2 Cost
2 6.75 NaN
2 8.54 NaN
3 1.5 Payments
3 3.65 Products
这是期望的输出
df2 = pd.DataFrame([[1, 2, 'Cost',6.83], [1, 5.72, NaN, NaN], [1, 8.83, 'Billing',4.03], [1, 12.86, NaN,NaN], [2, 2, 'Cost',6.54], [2, 6.75, NaN, NaN], [2, 8.54, NaN, NaN], [3, 1.5, 'Payments', 2.15],[3, 3.65, 'Products','...']], columns=['interaction_id', 'start_time', 'topic','topic_length])
interaction_id start_time topic topic_length
1 2 Cost 6.83
1 5.72 NaN NaN
1 8.83 Billing 4.03
1 12.86 NaN NaN
2 2 Cost 6.54
2 6.75 NaN NaN
2 8.54 NaN NaN
3 1.5 Payments 2.15
3 3.65 Products ....
我希望这是有道理的
可以试试下面的方法吗?
我正在为每个调用(交互)应用一个函数,然后为每个调用的每个主题分配一个唯一编号 (ngroup)。然后我将呼叫结束分配给它自己的号码 (-1)。然后我用diff来计算话题长度。
import pandas as pd
import numpy as np
from numpy import nan as NaN
df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])
def func(df):
ngroup_df = pd.DataFrame({"topic":df.ffill()['topic'].drop_duplicates().to_list(),"ngroup":[i for i in range(len(df.ffill()['topic'].drop_duplicates().to_list()))][::-1]})
df = df.ffill().merge(ngroup_df)
df.loc[df.index.max(), 'ngroup'] = -1
length_df = df[['start_time','ngroup']].groupby('ngroup').min().diff().dropna().rename({'start_time':'length'}, axis = 1).reset_index()
length_df['length'] = length_df['length'].abs()
df.loc[df.index.max(), 'ngroup'] = 0
return df.merge(length_df, how = 'left')
>>> print(df.groupby(['interaction_id']).apply(func).reset_index(drop = True))
interaction_id start_time topic ngroup length
0 1 2.00 Cost 1 6.83
1 1 5.72 Cost 1 6.83
2 1 8.83 Billing 0 4.03
3 1 12.86 Billing 0 4.03
4 2 2.00 Cost 0 6.54
5 2 6.75 Cost 0 6.54
6 2 8.54 Cost 0 6.54
7 3 1.50 Payments 1 2.15
8 3 3.65 Products 0 NaN
我有一个数据框,我正在尝试计算两个不同主题之间的时间差,同时保持在一个通话中而不是溢出到一个新的通话中(即同时确保它没有计算出不同通话中主题之间的时间差).其中 interaction_id 是一个单独的调用
这是一个示例数据框
df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])
interaction_id start_time topic
1 2 Cost
1 5.72 NaN
1 8.83 Billing
1 12.86 NaN
2 2 Cost
2 6.75 NaN
2 8.54 NaN
3 1.5 Payments
3 3.65 Products
这是期望的输出
df2 = pd.DataFrame([[1, 2, 'Cost',6.83], [1, 5.72, NaN, NaN], [1, 8.83, 'Billing',4.03], [1, 12.86, NaN,NaN], [2, 2, 'Cost',6.54], [2, 6.75, NaN, NaN], [2, 8.54, NaN, NaN], [3, 1.5, 'Payments', 2.15],[3, 3.65, 'Products','...']], columns=['interaction_id', 'start_time', 'topic','topic_length])
interaction_id start_time topic topic_length
1 2 Cost 6.83
1 5.72 NaN NaN
1 8.83 Billing 4.03
1 12.86 NaN NaN
2 2 Cost 6.54
2 6.75 NaN NaN
2 8.54 NaN NaN
3 1.5 Payments 2.15
3 3.65 Products ....
我希望这是有道理的
可以试试下面的方法吗?
我正在为每个调用(交互)应用一个函数,然后为每个调用的每个主题分配一个唯一编号 (ngroup)。然后我将呼叫结束分配给它自己的号码 (-1)。然后我用diff来计算话题长度。
import pandas as pd
import numpy as np
from numpy import nan as NaN
df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])
def func(df):
ngroup_df = pd.DataFrame({"topic":df.ffill()['topic'].drop_duplicates().to_list(),"ngroup":[i for i in range(len(df.ffill()['topic'].drop_duplicates().to_list()))][::-1]})
df = df.ffill().merge(ngroup_df)
df.loc[df.index.max(), 'ngroup'] = -1
length_df = df[['start_time','ngroup']].groupby('ngroup').min().diff().dropna().rename({'start_time':'length'}, axis = 1).reset_index()
length_df['length'] = length_df['length'].abs()
df.loc[df.index.max(), 'ngroup'] = 0
return df.merge(length_df, how = 'left')
>>> print(df.groupby(['interaction_id']).apply(func).reset_index(drop = True))
interaction_id start_time topic ngroup length
0 1 2.00 Cost 1 6.83
1 1 5.72 Cost 1 6.83
2 1 8.83 Billing 0 4.03
3 1 12.86 Billing 0 4.03
4 2 2.00 Cost 0 6.54
5 2 6.75 Cost 0 6.54
6 2 8.54 Cost 0 6.54
7 3 1.50 Payments 1 2.15
8 3 3.65 Products 0 NaN