使用 Pandas 计算一组计数的情况
Calculate a case when count in a group by using Pandas
我是使用 python 的漂亮初学者,
我试图在一个代码行中计算打开率比率(两个不同计数之间的比率)。
我的数据框是这样的:
df = pd.DataFrame([
(142, 1, 'open' , 'Mobile'),
(144, 2, 'open' , 'Mobile'),
(144, 1, 'delivered', 'Web'),
(142, 1, 'delivered', 'Mobile'),
(142, 2, 'delivered', 'Web'),
(144, 1, 'open', 'Web'),
(142, 2, 'open', 'Mobile')
], columns=['sent_mail_id', 'customer_id', 'event' , 'Tool_used'])
我想在使用 Pandas 按列 Tool_used 分组时计算打开率。
在 SQL 语言中是这样的:
select
Tool_used ,
count(distinct case when event='open' then sent_mail_id end)/count(distinct case when
event='delivered' then sent_mail_id end)
from df
group by 1
请注意,我需要清楚地计算 sent_mail_id,因为需要唯一计数。
谢谢
看看这是否是您需要的,每组中有 open rate ratio
列:
df1 = ((df.loc[df['event'] == 'open'].groupby('Tool_used')['event'].count()
/
df.loc[df['event'] == 'delivered'].groupby('Tool_used')['event'].count())
.to_frame(name='open rate ratio')
).reset_index()
结果:
print(df1)
Tool_used open rate ratio
0 Mobile 3.0
1 Web 0.5
使用crosstab
, so then is necessary only divide columns open
with delivered
with Series.reset_index
:
df1 = pd.crosstab(df['Tool_used'], df['event'])
print (df1)
event delivered open
Tool_used
Mobile 1 3
Web 2 1
df2 = df1['open'].div(df1['delivered']).reset_index(name='open rate ratio')
print (df2)
Tool_used open rate ratio
0 Mobile 3.0
1 Web 0.5
如果需要groupby
比较和聚合sum
,但我认为这更复杂:
a = (df['event'] == 'open').groupby(df['Tool_used']).sum()
b = (df['event'] == 'delivered').groupby(df['Tool_used']).sum()
df2 = a.div(b).reset_index(name='open rate ratio')
print (df2)
Tool_used open rate ratio
0 Mobile 3.0
1 Web 0.5
带有自定义函数的解决方案(大数据时性能较差):
def f(x):
return (x == 'open').sum() / (x == 'delivered').sum()
df2 = df.groupby('Tool_used')['event'].agg(f).reset_index(name='open rate ratio')
print (df2)
Tool_used open rate ratio
0 Mobile 3.0
1 Web 0.5
我是使用 python 的漂亮初学者, 我试图在一个代码行中计算打开率比率(两个不同计数之间的比率)。 我的数据框是这样的:
df = pd.DataFrame([
(142, 1, 'open' , 'Mobile'),
(144, 2, 'open' , 'Mobile'),
(144, 1, 'delivered', 'Web'),
(142, 1, 'delivered', 'Mobile'),
(142, 2, 'delivered', 'Web'),
(144, 1, 'open', 'Web'),
(142, 2, 'open', 'Mobile')
], columns=['sent_mail_id', 'customer_id', 'event' , 'Tool_used'])
我想在使用 Pandas 按列 Tool_used 分组时计算打开率。 在 SQL 语言中是这样的:
select
Tool_used ,
count(distinct case when event='open' then sent_mail_id end)/count(distinct case when
event='delivered' then sent_mail_id end)
from df
group by 1
请注意,我需要清楚地计算 sent_mail_id,因为需要唯一计数。 谢谢
看看这是否是您需要的,每组中有 open rate ratio
列:
df1 = ((df.loc[df['event'] == 'open'].groupby('Tool_used')['event'].count()
/
df.loc[df['event'] == 'delivered'].groupby('Tool_used')['event'].count())
.to_frame(name='open rate ratio')
).reset_index()
结果:
print(df1)
Tool_used open rate ratio
0 Mobile 3.0
1 Web 0.5
使用crosstab
, so then is necessary only divide columns open
with delivered
with Series.reset_index
:
df1 = pd.crosstab(df['Tool_used'], df['event'])
print (df1)
event delivered open
Tool_used
Mobile 1 3
Web 2 1
df2 = df1['open'].div(df1['delivered']).reset_index(name='open rate ratio')
print (df2)
Tool_used open rate ratio
0 Mobile 3.0
1 Web 0.5
如果需要groupby
比较和聚合sum
,但我认为这更复杂:
a = (df['event'] == 'open').groupby(df['Tool_used']).sum()
b = (df['event'] == 'delivered').groupby(df['Tool_used']).sum()
df2 = a.div(b).reset_index(name='open rate ratio')
print (df2)
Tool_used open rate ratio
0 Mobile 3.0
1 Web 0.5
带有自定义函数的解决方案(大数据时性能较差):
def f(x):
return (x == 'open').sum() / (x == 'delivered').sum()
df2 = df.groupby('Tool_used')['event'].agg(f).reset_index(name='open rate ratio')
print (df2)
Tool_used open rate ratio
0 Mobile 3.0
1 Web 0.5