如何计算不同行对的平均值并从数据框中删除 N-1 行？

Question

我有一个像

这样的数据框

id	group	person	company	time
345	2020-04-01	user1	A	10:04:05
346	2020-04-01	user1	A	10:14:05
347	2020-04-01	user2	B	10:24:05
348	2020-04-01	user1	A	11:04:05
349	2020-04-01	user2	B	11:06:05
...	...	...	...	...
1000	2020-04-20	user1	AA	11:04:05
1034	2020-04-20	user1	AA	12:04:05
1078	2020-04-21	user2	BB	12:34:05
1200	2020-04-22	user1	AA	12:40:05

这是消息列表，其中 user1 是顾问，userN 是来自不同公司的客户。我还添加了 group 列，其中我添加了发送此消息的日期。

我需要计算不同类型用户之间的平均时间，即： in 2020-04-01 **user1** sent the 1st message in 10:04:05 and **user2** answered in 10:24:05, diff 20 min 而在这一天 user1 sent the 2nd message in 11:04:05 and user2 answered in 11:06:05, diff is 2 min。知道几个差异周期我可以计算 mean() 如果我只有来自一种类型的用户的消息，我的平均值将是 'no answered'

我的代码在这里

fin = fin.reset_index() # reset indexes
# here I wanna leave only the first message of each type of users, convert [user1, user1, user2] to [user1, user2]
test = fin.loc[fin['sender_full_name'].shift() != fin['sender_full_name']]
g = test.groupby('group') # got the series of group
for i in g.groups: # iterate over every group element
    id = g.get_group(i).index # got the index of this group
    f = test.loc[id] # new dataframe by index
    ds = pd.Series(f['timestamp']).reset_index(drop=True) # got all timestamps by date 
    
    avg_idx = pd.Series(f['id'])
    s1 = pd.Series([])
    s2 = pd.Series([])
    for j in range(ds.size):
        s1 = s1.append([pd.Series(ds[j])], ignore_index=True) if j % 2 == 0 else s1
        s2 = s2.append([pd.Series(ds[j])], ignore_index=True) if j % 2 != 0 else s2
        s3 = s2.subtract(s1) if len(s2) > 0 else 'без ответа'
        s3 = s3.loc[~s3.isna()].mean() if len(s2) > 0 else s3
        fin.loc[fin['id'].isin(avg_idx), 'avg'] = s3 # write new value of average
fin

但我没有得到预期的值，之后我想删除组中的其他行而不是第一行，即

来自

id	group	person	company	timestamp
1000	2020-04-20	user1	AA	11:04:05
1034	2020-04-20	user1	AA	12:04:05
1078	2020-04-21	user2	BB	12:34:05
1200	2020-04-22	user1	AA	12:40:05

到

id	group	person	company	timestamp
1000	2020-04-20	user1	AA	11:04:05
1078	2020-04-21	user2	BB	12:34:05
1200	2020-04-22	user1	AA	12:40:05

Answer 1

让我们将独白定义为来自同一个人的一系列信息。以下是您如何获得客户的每个独白开始与之前顾问的最后一个独白开始之间的时间差。

import pandas as pd

df = pd.DataFrame({
    "group": [
        "2020-04-01", "2020-04-01", "2020-04-01", "2020-04-01",
        "2020-04-01", "2020-04-20", "2020-04-20", "2020-04-21",
        "2020-04-22"
    ],
    "person": [
        "user1", "user1", "user2", "user1", "user2", "user1",
        "user1", "user2", "user1"
    ],
    "time": [
        "10:04:05", "10:14:05", "10:24:05", "11:04:05", "11:06:05",
        "11:04:05", "12:04:05", "12:34:05", "12:40:05"
    ],
    "company": ["A", "A", "B", "A", "B", "AA", "AA", "BB", "AA"],
})

# Only keep first message for each monologue
df = df[df["person"] != df["person"].shift()]

# Add a timestamp column for time difference computations
df["timestamp"] = pd.to_datetime(df["group"] + " " + df["time"])

# Keep time when user is user1, NaN otherwise
person_is_user1 = df["person"] == "user1"
user1_time = df["timestamp"].where(person_is_user1)

# Then replace NaNs with the closest earlier non-NaN value
last_user1_time = user1_time.fillna(method="ffill")

# Then exclude rows where user is user1
last_user1_time = last_user1_time.where(~person_is_user1)
df["diff"] = df["timestamp"] - last_user1_time

结果：

        group person      time company           timestamp            diff
0  2020-04-01  user1  10:04:05       A 2020-04-01 10:04:05             NaT
2  2020-04-01  user2  10:24:05       B 2020-04-01 10:24:05 0 days 00:20:00
3  2020-04-01  user1  11:04:05       A 2020-04-01 11:04:05             NaT
4  2020-04-01  user2  11:06:05       B 2020-04-01 11:06:05 0 days 00:02:00
5  2020-04-20  user1  11:04:05      AA 2020-04-20 11:04:05             NaT
7  2020-04-21  user2  12:34:05      BB 2020-04-21 12:34:05 1 days 01:30:00
8  2020-04-22  user1  12:40:05      AA 2020-04-22 12:40:05             NaT

然后您可以调用 df["diff"].mean() 来获得平均差异：

>>> df["diff"].mean()
Timedelta('0 days 08:37:20')

如何计算不同行对的平均值并从数据框中删除 N-1 行？

How to calculate average value of different pairs of rows and delete N-1 rows from dataframe?

python

numpy

dataframe

pandas

data-science