Python - 按组计算连续频率
Python - Count consecutive frequencies by group
我有一系列按时间戳和 user_id 排序的电子邮件。
我想调查电子邮件 j 关注电子邮件 i 的频率。我将在热图中显示用户之间的这些频率,以显示最常见的路径。
a = """timestamp,email,subject
2016-07-01 10:17:00,a@gmail.com,subject2
2016-07-01 02:01:02,a@gmail.com,welcome
2016-07-01 14:45:04,a@gmail.com,subject3
2016-07-01 08:14:02,a@gmail.com,subject1
2016-07-01 16:26:35,a@gmail.com,subject4
2016-07-01 10:17:00,b@gmail.com,subject1
2016-07-01 02:01:02,b@gmail.com,welcome
2016-07-01 14:45:04,b@gmail.com,subject3
2016-07-01 08:14:02,b@gmail.com,subject2
2016-07-01 16:26:35,b@gmail.com,subject4
2016-07-01 18:00:00,c@gmail.com,welcome
2016-07-01 19:00:02,c@gmail.com,subject1
2016-07-01 20:00:04,c@gmail.com,subject3
2016-07-01 21:14:02,c@gmail.com,subject4
2016-07-01 21:26:35,c@gmail.com,subject2
"""
import pandas as pd
from pandas.io.parsers import StringIO
df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df1=df1.sort_values(['email','timestamp'])
已排序 df1:
timestamp email subject
1 2016-07-01 02:01:02 a@gmail.com welcome
3 2016-07-01 08:14:02 a@gmail.com subject1
0 2016-07-01 10:17:00 a@gmail.com subject2
2 2016-07-01 14:45:04 a@gmail.com subject3
4 2016-07-01 16:26:35 a@gmail.com subject4
6 2016-07-01 02:01:02 b@gmail.com welcome
8 2016-07-01 08:14:02 b@gmail.com subject2
5 2016-07-01 10:17:00 b@gmail.com subject1
7 2016-07-01 14:45:04 b@gmail.com subject3
9 2016-07-01 16:26:35 b@gmail.com subject4
10 2016-07-01 18:00:00 c@gmail.com welcome
11 2016-07-01 19:00:02 c@gmail.com subject1
12 2016-07-01 20:00:04 c@gmail.com subject3
13 2016-07-01 21:14:02 c@gmail.com subject4
14 2016-07-01 21:26:35 c@gmail.com subject2
输出应如下所示
welcome subject1 subject2 subject3 subject4
welcome 0
subject1 2 0
subject2 1 1 0
subject3 0 2 1 0
subject4 0 0 0 3 0
换句话说,有 2 次 subject1 在欢迎邮件之后。有 1 次主题 2 在欢迎信息等之后跟进。
最好的方法是什么?
两行(你可以压缩成一行):
df1['next_subject'] = df1.groupby('email')['subject'].shift(-1)
res = pd.crosstab(df1['next_subject'], df1['subject'])
print(res)
# subject subject1 subject2 subject3 subject4 welcome
# next_subject
# subject1 0 1 0 0 2
# subject2 1 0 0 1 1
# subject3 2 1 0 0 0
# subject4 0 0 3 0 0
您可以稍微调整一下,使其完全符合您在 OP 中引用的形式:
subjects = ['welcome'] + ['subject{}'.format(i) for i in range(1, 5)]
res = res.loc[subjects, subjects].fillna(0).astype(int)
print(res)
# subject welcome subject1 subject2 subject3 subject4
# next_subject
# welcome 0 0 0 0 0
# subject1 2 0 1 0 0
# subject2 1 1 0 0 1
# subject3 0 2 1 0 0
# subject4 0 0 0 3 0
我有一系列按时间戳和 user_id 排序的电子邮件。
我想调查电子邮件 j 关注电子邮件 i 的频率。我将在热图中显示用户之间的这些频率,以显示最常见的路径。
a = """timestamp,email,subject
2016-07-01 10:17:00,a@gmail.com,subject2
2016-07-01 02:01:02,a@gmail.com,welcome
2016-07-01 14:45:04,a@gmail.com,subject3
2016-07-01 08:14:02,a@gmail.com,subject1
2016-07-01 16:26:35,a@gmail.com,subject4
2016-07-01 10:17:00,b@gmail.com,subject1
2016-07-01 02:01:02,b@gmail.com,welcome
2016-07-01 14:45:04,b@gmail.com,subject3
2016-07-01 08:14:02,b@gmail.com,subject2
2016-07-01 16:26:35,b@gmail.com,subject4
2016-07-01 18:00:00,c@gmail.com,welcome
2016-07-01 19:00:02,c@gmail.com,subject1
2016-07-01 20:00:04,c@gmail.com,subject3
2016-07-01 21:14:02,c@gmail.com,subject4
2016-07-01 21:26:35,c@gmail.com,subject2
"""
import pandas as pd
from pandas.io.parsers import StringIO
df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df1=df1.sort_values(['email','timestamp'])
已排序 df1:
timestamp email subject
1 2016-07-01 02:01:02 a@gmail.com welcome
3 2016-07-01 08:14:02 a@gmail.com subject1
0 2016-07-01 10:17:00 a@gmail.com subject2
2 2016-07-01 14:45:04 a@gmail.com subject3
4 2016-07-01 16:26:35 a@gmail.com subject4
6 2016-07-01 02:01:02 b@gmail.com welcome
8 2016-07-01 08:14:02 b@gmail.com subject2
5 2016-07-01 10:17:00 b@gmail.com subject1
7 2016-07-01 14:45:04 b@gmail.com subject3
9 2016-07-01 16:26:35 b@gmail.com subject4
10 2016-07-01 18:00:00 c@gmail.com welcome
11 2016-07-01 19:00:02 c@gmail.com subject1
12 2016-07-01 20:00:04 c@gmail.com subject3
13 2016-07-01 21:14:02 c@gmail.com subject4
14 2016-07-01 21:26:35 c@gmail.com subject2
输出应如下所示
welcome subject1 subject2 subject3 subject4
welcome 0
subject1 2 0
subject2 1 1 0
subject3 0 2 1 0
subject4 0 0 0 3 0
换句话说,有 2 次 subject1 在欢迎邮件之后。有 1 次主题 2 在欢迎信息等之后跟进。
最好的方法是什么?
两行(你可以压缩成一行):
df1['next_subject'] = df1.groupby('email')['subject'].shift(-1)
res = pd.crosstab(df1['next_subject'], df1['subject'])
print(res)
# subject subject1 subject2 subject3 subject4 welcome
# next_subject
# subject1 0 1 0 0 2
# subject2 1 0 0 1 1
# subject3 2 1 0 0 0
# subject4 0 0 3 0 0
您可以稍微调整一下,使其完全符合您在 OP 中引用的形式:
subjects = ['welcome'] + ['subject{}'.format(i) for i in range(1, 5)]
res = res.loc[subjects, subjects].fillna(0).astype(int)
print(res)
# subject welcome subject1 subject2 subject3 subject4
# next_subject
# welcome 0 0 0 0 0
# subject1 2 0 1 0 0
# subject2 1 1 0 0 1
# subject3 0 2 1 0 0
# subject4 0 0 0 3 0