Pandas/python 按列分析频率直方图并将另一列聚合到这些桶
Pandas/python analyse frequency histogram by a column and aggregate another column to those buckets
我正在使用 pandas 和如下数据框:
Name
percent
Amount
A
3
34
B
5
200
C
30
20
D
1
12
我想为 percent
列创建存储桶,例如 0-5
、6-15
、>16
。使用这些桶,我记录了 percent
列的计数(这实际上是一个直方图),但也在同一个桶中记录了 Amount
的 平均值。
使用上面的例子:
Bucket
percent count
Avg. Amount
5
3
82
15
0
0
>15
1
20
如何在 python
和 pandas
(或任何其他库)中实现此目的
您需要使用pandas.cut
and groupby
+agg
:
(df.assign(Bucket=pd.cut(df['percent '],
[0, 5, 15, float('inf')],
labels=['0-5', '6-15', '>15']))
.groupby('Bucket').agg(**{'percent count': ('percent ', 'count'),
'Avg. Amount': ('Amount', 'mean')
})
.fillna(0, downcast='infer')
.reset_index()
)
输出:
Bucket percent count Avg. Amount
0 0-5 3 82
1 6-15 0 0
2 >15 1 20
使用numpy.select
, Series.between
and Groupby.agg
:
In [232]: import numpy as np
In [233]: conds = [df['percent'].between(0,5), df['percent'].between(6,15), df['percent'].gt(15)]
In [234]: choices = ['5', '15', '>15']
In [237]: df['Bucket'] = np.select(conds, choices)
In [245]: res = df.groupby('Bucket').agg({'percent': 'count', 'Amount': 'mean'}).reindex(choices).fillna(0).astype(int).reset_index()
In [246]: res
Out[246]:
Bucket percent Amount
0 5 3 82
1 15 0 0
2 >15 1 20
时间:
@mozway 的解决方案:
In [257]: def f2():
...: (df.assign(Bucket=pd.cut(df['percent'],[0, 5, 15, float('inf')],labels=['0-5', '6-15', '>15'])).groupby('Bucket').agg(**{'percent count': ('percent', 'count'),'Avg. Amount': ('Amount', 'mean')}).fillna(0, dow
...: ncast='infer').reset_index())
...:
In [258]: %timeit f2()
8.02 ms ± 587 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我的解决方案:
In [253]: def f1():
...: conds = [df['percent'].between(0,5), df['percent'].between(6,15), df['percent'].gt(15)]
...: choices = ['5', '15', '>15']
...: df['Bucket'] = np.select(conds, choices)
...: res = df.groupby('Bucket').agg({'percent': 'count', 'Amount': 'mean'}).reindex(choices).fillna(0).astype(int).reset_index()
...:
In [254]: %timeit f1()
3.64 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我正在使用 pandas 和如下数据框:
Name | percent | Amount |
---|---|---|
A | 3 | 34 |
B | 5 | 200 |
C | 30 | 20 |
D | 1 | 12 |
我想为 percent
列创建存储桶,例如 0-5
、6-15
、>16
。使用这些桶,我记录了 percent
列的计数(这实际上是一个直方图),但也在同一个桶中记录了 Amount
的 平均值。
使用上面的例子:
Bucket | percent count | Avg. Amount |
---|---|---|
5 | 3 | 82 |
15 | 0 | 0 |
>15 | 1 | 20 |
如何在 python
和 pandas
(或任何其他库)中实现此目的
您需要使用pandas.cut
and groupby
+agg
:
(df.assign(Bucket=pd.cut(df['percent '],
[0, 5, 15, float('inf')],
labels=['0-5', '6-15', '>15']))
.groupby('Bucket').agg(**{'percent count': ('percent ', 'count'),
'Avg. Amount': ('Amount', 'mean')
})
.fillna(0, downcast='infer')
.reset_index()
)
输出:
Bucket percent count Avg. Amount
0 0-5 3 82
1 6-15 0 0
2 >15 1 20
使用numpy.select
, Series.between
and Groupby.agg
:
In [232]: import numpy as np
In [233]: conds = [df['percent'].between(0,5), df['percent'].between(6,15), df['percent'].gt(15)]
In [234]: choices = ['5', '15', '>15']
In [237]: df['Bucket'] = np.select(conds, choices)
In [245]: res = df.groupby('Bucket').agg({'percent': 'count', 'Amount': 'mean'}).reindex(choices).fillna(0).astype(int).reset_index()
In [246]: res
Out[246]:
Bucket percent Amount
0 5 3 82
1 15 0 0
2 >15 1 20
时间:
@mozway 的解决方案:
In [257]: def f2():
...: (df.assign(Bucket=pd.cut(df['percent'],[0, 5, 15, float('inf')],labels=['0-5', '6-15', '>15'])).groupby('Bucket').agg(**{'percent count': ('percent', 'count'),'Avg. Amount': ('Amount', 'mean')}).fillna(0, dow
...: ncast='infer').reset_index())
...:
In [258]: %timeit f2()
8.02 ms ± 587 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我的解决方案:
In [253]: def f1():
...: conds = [df['percent'].between(0,5), df['percent'].between(6,15), df['percent'].gt(15)]
...: choices = ['5', '15', '>15']
...: df['Bucket'] = np.select(conds, choices)
...: res = df.groupby('Bucket').agg({'percent': 'count', 'Amount': 'mean'}).reindex(choices).fillna(0).astype(int).reset_index()
...:
In [254]: %timeit f1()
3.64 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)