如何计算不同行对的平均值并从数据框中删除 N-1 行?

How to calculate average value of different pairs of rows and delete N-1 rows from dataframe?

我有一个像

这样的数据框
id group person company time timestamp
345 2020-04-01 user1 A 10:04:05
346 2020-04-01 user1 A 10:14:05
347 2020-04-01 user2 B 10:24:05
348 2020-04-01 user1 A 11:04:05
349 2020-04-01 user2 B 11:06:05
... ... ... ... ...
1000 2020-04-20 user1 AA 11:04:05
1034 2020-04-20 user1 AA 12:04:05
1078 2020-04-21 user2 BB 12:34:05
1200 2020-04-22 user1 AA 12:40:05

这是消息列表,其中 user1 是顾问,userN 是来自不同公司的客户。 我还添加了 group 列,其中我添加了发送此消息的日期。

我需要计算不同类型用户之间的平均时间,即: in 2020-04-01 **user1** sent the 1st message in 10:04:05 and **user2** answered in 10:24:05, diff 20 min 而在这一天 user1 sent the 2nd message in 11:04:05 and user2 answered in 11:06:05, diff is 2 min。 知道几个差异周期我可以计算 mean() 如果我只有来自一种类型的用户的消息,我的平均值将是 'no answered'

我的代码在这里

fin = fin.reset_index() # reset indexes
# here I wanna leave only the first message of each type of users, convert [user1, user1, user2] to [user1, user2]
test = fin.loc[fin['sender_full_name'].shift() != fin['sender_full_name']]
g = test.groupby('group') # got the series of group
for i in g.groups: # iterate over every group element
    id = g.get_group(i).index # got the index of this group
    f = test.loc[id] # new dataframe by index
    ds = pd.Series(f['timestamp']).reset_index(drop=True) # got all timestamps by date 
    
    avg_idx = pd.Series(f['id'])
    s1 = pd.Series([])
    s2 = pd.Series([])
    for j in range(ds.size):
        s1 = s1.append([pd.Series(ds[j])], ignore_index=True) if j % 2 == 0 else s1
        s2 = s2.append([pd.Series(ds[j])], ignore_index=True) if j % 2 != 0 else s2
        s3 = s2.subtract(s1) if len(s2) > 0 else 'без ответа'
        s3 = s3.loc[~s3.isna()].mean() if len(s2) > 0 else s3
        fin.loc[fin['id'].isin(avg_idx), 'avg'] = s3 # write new value of average
fin

但我没有得到预期的值,之后我想删除组中的其他行而不是第一行,即

来自

id group person company timestamp
1000 2020-04-20 user1 AA 11:04:05
1034 2020-04-20 user1 AA 12:04:05
1078 2020-04-21 user2 BB 12:34:05
1200 2020-04-22 user1 AA 12:40:05

id group person company timestamp
1000 2020-04-20 user1 AA 11:04:05
1078 2020-04-21 user2 BB 12:34:05
1200 2020-04-22 user1 AA 12:40:05

让我们将独白定义为来自同一个人的一系列信息。以下是您如何获得客户的每个独白开始与之前顾问的最后一个独白开始之间的时间差。

import pandas as pd

df = pd.DataFrame({
    "group": [
        "2020-04-01", "2020-04-01", "2020-04-01", "2020-04-01",
        "2020-04-01", "2020-04-20", "2020-04-20", "2020-04-21",
        "2020-04-22"
    ],
    "person": [
        "user1", "user1", "user2", "user1", "user2", "user1",
        "user1", "user2", "user1"
    ],
    "time": [
        "10:04:05", "10:14:05", "10:24:05", "11:04:05", "11:06:05",
        "11:04:05", "12:04:05", "12:34:05", "12:40:05"
    ],
    "company": ["A", "A", "B", "A", "B", "AA", "AA", "BB", "AA"],
})

# Only keep first message for each monologue
df = df[df["person"] != df["person"].shift()]

# Add a timestamp column for time difference computations
df["timestamp"] = pd.to_datetime(df["group"] + " " + df["time"])

# Keep time when user is user1, NaN otherwise
person_is_user1 = df["person"] == "user1"
user1_time = df["timestamp"].where(person_is_user1)

# Then replace NaNs with the closest earlier non-NaN value
last_user1_time = user1_time.fillna(method="ffill")

# Then exclude rows where user is user1
last_user1_time = last_user1_time.where(~person_is_user1)
df["diff"] = df["timestamp"] - last_user1_time

结果:

        group person      time company           timestamp            diff
0  2020-04-01  user1  10:04:05       A 2020-04-01 10:04:05             NaT
2  2020-04-01  user2  10:24:05       B 2020-04-01 10:24:05 0 days 00:20:00
3  2020-04-01  user1  11:04:05       A 2020-04-01 11:04:05             NaT
4  2020-04-01  user2  11:06:05       B 2020-04-01 11:06:05 0 days 00:02:00
5  2020-04-20  user1  11:04:05      AA 2020-04-20 11:04:05             NaT
7  2020-04-21  user2  12:34:05      BB 2020-04-21 12:34:05 1 days 01:30:00
8  2020-04-22  user1  12:40:05      AA 2020-04-22 12:40:05             NaT

然后您可以调用 df["diff"].mean() 来获得平均差异:

>>> df["diff"].mean()
Timedelta('0 days 08:37:20')