Pandas/python 按列分析频率直方图并将另一列聚合到这些桶

Question

我正在使用 pandas 和如下数据框：

Name	percent	Amount
A	3	34
B	5	200
C	30	20
D	1	12

我想为 percent 列创建存储桶，例如 0-5、6-15、>16。使用这些桶，我记录了 percent 列的计数（这实际上是一个直方图），但也在同一个桶中记录了 Amount 的 平均值。

使用上面的例子：

Bucket	percent count	Avg. Amount
5	3	82
15	0	0
>15	1	20

如何在 python 和 pandas（或任何其他库）中实现此目的

Answer 1

您需要使用pandas.cut and groupby+agg:

(df.assign(Bucket=pd.cut(df['percent '],
                         [0, 5, 15, float('inf')],
                         labels=['0-5', '6-15', '>15']))
   .groupby('Bucket').agg(**{'percent count': ('percent ', 'count'),
                            'Avg. Amount': ('Amount', 'mean')
                           })
   .fillna(0, downcast='infer')
   .reset_index()
)

输出：

  Bucket  percent count  Avg. Amount
0    0-5              3           82
1   6-15              0            0
2    >15              1           20

Answer 2

使用numpy.select, Series.between and Groupby.agg:

In [232]: import numpy as np

In [233]: conds = [df['percent'].between(0,5), df['percent'].between(6,15), df['percent'].gt(15)]

In [234]: choices = ['5', '15', '>15']

In [237]: df['Bucket'] = np.select(conds, choices)

In [245]: res = df.groupby('Bucket').agg({'percent': 'count', 'Amount': 'mean'}).reindex(choices).fillna(0).astype(int).reset_index()

In [246]: res
Out[246]: 
  Bucket  percent  Amount
0      5        3      82
1     15        0       0
2    >15        1      20

时间：

@mozway 的解决方案：

In [257]: def f2():
     ...:     (df.assign(Bucket=pd.cut(df['percent'],[0, 5, 15, float('inf')],labels=['0-5', '6-15', '>15'])).groupby('Bucket').agg(**{'percent count': ('percent', 'count'),'Avg. Amount': ('Amount', 'mean')}).fillna(0, dow
     ...: ncast='infer').reset_index())
     ...: 

In [258]: %timeit f2()
8.02 ms ± 587 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我的解决方案：

In [253]: def f1():
     ...:     conds = [df['percent'].between(0,5), df['percent'].between(6,15), df['percent'].gt(15)]
     ...:     choices = ['5', '15', '>15']
     ...:     df['Bucket'] = np.select(conds, choices)
     ...:     res = df.groupby('Bucket').agg({'percent': 'count', 'Amount': 'mean'}).reindex(choices).fillna(0).astype(int).reset_index()
     ...: 

In [254]: %timeit f1()
3.64 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas/python 按列分析频率直方图并将另一列聚合到这些桶

Pandas/python analyse frequency histogram by a column and aggregate another column to those buckets

python

aggregate

frequency

histogram

pandas