如何计算不同行对的平均值并从数据框中删除 N-1 行?
How to calculate average value of different pairs of rows and delete N-1 rows from dataframe?
我有一个像
这样的数据框
id
group
person
company
time
timestamp
345
2020-04-01
user1
A
10:04:05
346
2020-04-01
user1
A
10:14:05
347
2020-04-01
user2
B
10:24:05
348
2020-04-01
user1
A
11:04:05
349
2020-04-01
user2
B
11:06:05
...
...
...
...
...
1000
2020-04-20
user1
AA
11:04:05
1034
2020-04-20
user1
AA
12:04:05
1078
2020-04-21
user2
BB
12:34:05
1200
2020-04-22
user1
AA
12:40:05
这是消息列表,其中 user1 是顾问,userN 是来自不同公司的客户。
我还添加了 group 列,其中我添加了发送此消息的日期。
我需要计算不同类型用户之间的平均时间,即:
in 2020-04-01 **user1** sent the 1st message in 10:04:05 and **user2** answered in 10:24:05, diff 20 min
而在这一天 user1 sent the 2nd message in 11:04:05 and user2 answered in 11:06:05, diff is 2 min
。
知道几个差异周期我可以计算 mean()
如果我只有来自一种类型的用户的消息,我的平均值将是 'no answered'
我的代码在这里
fin = fin.reset_index() # reset indexes
# here I wanna leave only the first message of each type of users, convert [user1, user1, user2] to [user1, user2]
test = fin.loc[fin['sender_full_name'].shift() != fin['sender_full_name']]
g = test.groupby('group') # got the series of group
for i in g.groups: # iterate over every group element
id = g.get_group(i).index # got the index of this group
f = test.loc[id] # new dataframe by index
ds = pd.Series(f['timestamp']).reset_index(drop=True) # got all timestamps by date
avg_idx = pd.Series(f['id'])
s1 = pd.Series([])
s2 = pd.Series([])
for j in range(ds.size):
s1 = s1.append([pd.Series(ds[j])], ignore_index=True) if j % 2 == 0 else s1
s2 = s2.append([pd.Series(ds[j])], ignore_index=True) if j % 2 != 0 else s2
s3 = s2.subtract(s1) if len(s2) > 0 else 'без ответа'
s3 = s3.loc[~s3.isna()].mean() if len(s2) > 0 else s3
fin.loc[fin['id'].isin(avg_idx), 'avg'] = s3 # write new value of average
fin
但我没有得到预期的值,之后我想删除组中的其他行而不是第一行,即
来自
id
group
person
company
timestamp
1000
2020-04-20
user1
AA
11:04:05
1034
2020-04-20
user1
AA
12:04:05
1078
2020-04-21
user2
BB
12:34:05
1200
2020-04-22
user1
AA
12:40:05
到
id
group
person
company
timestamp
1000
2020-04-20
user1
AA
11:04:05
1078
2020-04-21
user2
BB
12:34:05
1200
2020-04-22
user1
AA
12:40:05
让我们将独白定义为来自同一个人的一系列信息。以下是您如何获得客户的每个独白开始与之前顾问的最后一个独白开始之间的时间差。
import pandas as pd
df = pd.DataFrame({
"group": [
"2020-04-01", "2020-04-01", "2020-04-01", "2020-04-01",
"2020-04-01", "2020-04-20", "2020-04-20", "2020-04-21",
"2020-04-22"
],
"person": [
"user1", "user1", "user2", "user1", "user2", "user1",
"user1", "user2", "user1"
],
"time": [
"10:04:05", "10:14:05", "10:24:05", "11:04:05", "11:06:05",
"11:04:05", "12:04:05", "12:34:05", "12:40:05"
],
"company": ["A", "A", "B", "A", "B", "AA", "AA", "BB", "AA"],
})
# Only keep first message for each monologue
df = df[df["person"] != df["person"].shift()]
# Add a timestamp column for time difference computations
df["timestamp"] = pd.to_datetime(df["group"] + " " + df["time"])
# Keep time when user is user1, NaN otherwise
person_is_user1 = df["person"] == "user1"
user1_time = df["timestamp"].where(person_is_user1)
# Then replace NaNs with the closest earlier non-NaN value
last_user1_time = user1_time.fillna(method="ffill")
# Then exclude rows where user is user1
last_user1_time = last_user1_time.where(~person_is_user1)
df["diff"] = df["timestamp"] - last_user1_time
结果:
group person time company timestamp diff
0 2020-04-01 user1 10:04:05 A 2020-04-01 10:04:05 NaT
2 2020-04-01 user2 10:24:05 B 2020-04-01 10:24:05 0 days 00:20:00
3 2020-04-01 user1 11:04:05 A 2020-04-01 11:04:05 NaT
4 2020-04-01 user2 11:06:05 B 2020-04-01 11:06:05 0 days 00:02:00
5 2020-04-20 user1 11:04:05 AA 2020-04-20 11:04:05 NaT
7 2020-04-21 user2 12:34:05 BB 2020-04-21 12:34:05 1 days 01:30:00
8 2020-04-22 user1 12:40:05 AA 2020-04-22 12:40:05 NaT
然后您可以调用 df["diff"].mean()
来获得平均差异:
>>> df["diff"].mean()
Timedelta('0 days 08:37:20')
我有一个像
这样的数据框id | group | person | company | time | timestamp |
---|---|---|---|---|---|
345 | 2020-04-01 | user1 | A | 10:04:05 | |
346 | 2020-04-01 | user1 | A | 10:14:05 | |
347 | 2020-04-01 | user2 | B | 10:24:05 | |
348 | 2020-04-01 | user1 | A | 11:04:05 | |
349 | 2020-04-01 | user2 | B | 11:06:05 | |
... | ... | ... | ... | ... | |
1000 | 2020-04-20 | user1 | AA | 11:04:05 | |
1034 | 2020-04-20 | user1 | AA | 12:04:05 | |
1078 | 2020-04-21 | user2 | BB | 12:34:05 | |
1200 | 2020-04-22 | user1 | AA | 12:40:05 |
这是消息列表,其中 user1 是顾问,userN 是来自不同公司的客户。 我还添加了 group 列,其中我添加了发送此消息的日期。
我需要计算不同类型用户之间的平均时间,即:
in 2020-04-01 **user1** sent the 1st message in 10:04:05 and **user2** answered in 10:24:05, diff 20 min
而在这一天 user1 sent the 2nd message in 11:04:05 and user2 answered in 11:06:05, diff is 2 min
。
知道几个差异周期我可以计算 mean()
如果我只有来自一种类型的用户的消息,我的平均值将是 'no answered'
我的代码在这里
fin = fin.reset_index() # reset indexes
# here I wanna leave only the first message of each type of users, convert [user1, user1, user2] to [user1, user2]
test = fin.loc[fin['sender_full_name'].shift() != fin['sender_full_name']]
g = test.groupby('group') # got the series of group
for i in g.groups: # iterate over every group element
id = g.get_group(i).index # got the index of this group
f = test.loc[id] # new dataframe by index
ds = pd.Series(f['timestamp']).reset_index(drop=True) # got all timestamps by date
avg_idx = pd.Series(f['id'])
s1 = pd.Series([])
s2 = pd.Series([])
for j in range(ds.size):
s1 = s1.append([pd.Series(ds[j])], ignore_index=True) if j % 2 == 0 else s1
s2 = s2.append([pd.Series(ds[j])], ignore_index=True) if j % 2 != 0 else s2
s3 = s2.subtract(s1) if len(s2) > 0 else 'без ответа'
s3 = s3.loc[~s3.isna()].mean() if len(s2) > 0 else s3
fin.loc[fin['id'].isin(avg_idx), 'avg'] = s3 # write new value of average
fin
但我没有得到预期的值,之后我想删除组中的其他行而不是第一行,即
来自
id | group | person | company | timestamp |
---|---|---|---|---|
1000 | 2020-04-20 | user1 | AA | 11:04:05 |
1034 | 2020-04-20 | user1 | AA | 12:04:05 |
1078 | 2020-04-21 | user2 | BB | 12:34:05 |
1200 | 2020-04-22 | user1 | AA | 12:40:05 |
到
id | group | person | company | timestamp |
---|---|---|---|---|
1000 | 2020-04-20 | user1 | AA | 11:04:05 |
1078 | 2020-04-21 | user2 | BB | 12:34:05 |
1200 | 2020-04-22 | user1 | AA | 12:40:05 |
让我们将独白定义为来自同一个人的一系列信息。以下是您如何获得客户的每个独白开始与之前顾问的最后一个独白开始之间的时间差。
import pandas as pd
df = pd.DataFrame({
"group": [
"2020-04-01", "2020-04-01", "2020-04-01", "2020-04-01",
"2020-04-01", "2020-04-20", "2020-04-20", "2020-04-21",
"2020-04-22"
],
"person": [
"user1", "user1", "user2", "user1", "user2", "user1",
"user1", "user2", "user1"
],
"time": [
"10:04:05", "10:14:05", "10:24:05", "11:04:05", "11:06:05",
"11:04:05", "12:04:05", "12:34:05", "12:40:05"
],
"company": ["A", "A", "B", "A", "B", "AA", "AA", "BB", "AA"],
})
# Only keep first message for each monologue
df = df[df["person"] != df["person"].shift()]
# Add a timestamp column for time difference computations
df["timestamp"] = pd.to_datetime(df["group"] + " " + df["time"])
# Keep time when user is user1, NaN otherwise
person_is_user1 = df["person"] == "user1"
user1_time = df["timestamp"].where(person_is_user1)
# Then replace NaNs with the closest earlier non-NaN value
last_user1_time = user1_time.fillna(method="ffill")
# Then exclude rows where user is user1
last_user1_time = last_user1_time.where(~person_is_user1)
df["diff"] = df["timestamp"] - last_user1_time
结果:
group person time company timestamp diff
0 2020-04-01 user1 10:04:05 A 2020-04-01 10:04:05 NaT
2 2020-04-01 user2 10:24:05 B 2020-04-01 10:24:05 0 days 00:20:00
3 2020-04-01 user1 11:04:05 A 2020-04-01 11:04:05 NaT
4 2020-04-01 user2 11:06:05 B 2020-04-01 11:06:05 0 days 00:02:00
5 2020-04-20 user1 11:04:05 AA 2020-04-20 11:04:05 NaT
7 2020-04-21 user2 12:34:05 BB 2020-04-21 12:34:05 1 days 01:30:00
8 2020-04-22 user1 12:40:05 AA 2020-04-22 12:40:05 NaT
然后您可以调用 df["diff"].mean()
来获得平均差异:
>>> df["diff"].mean()
Timedelta('0 days 08:37:20')