应用引用多列的 groupby 最快最有效的方法

Question

假设我们有一个数据集。

tmp = pd.DataFrame({'hi': [1,2,3,3,5,6,3,2,3,2,1],
                    'bye': [12,23,35,35,53,62,31,22,33,22,12],
                    'yes': [12,2,32,3,5,6,23,2,32,2,21],
                    'no': [1,92,93,3,95,6,33,2,33,22,1],
                    'maybe': [91,2,32,3,95,69,3,2,93,2,1]})

在python中我们可以很容易地tmp.groupby('hi').agg(total_bye = ('bye', sum))得到每组的轮空总和。但是，如果我想引用多个列，在 python 中执行此操作的最快、最有效和最少的干净（易于阅读）编写代码是什么？特别是，我可以使用 df.groupby(my_cols).agg() 来做到这一点吗？最快的替代品是什么？我愿意（实际上更喜欢）使用比 pandas 更快的库，例如 dask 或 vaex。

例如，在 R data.table 中我们可以很容易地做到这一点，而且速度非常快


# In R, assume this object is a data.table
# In a single line, the below code groups by 'hi' and then creates my_new_col column based on if bye > 5 and yes <= 20, taking the sum of 'no' for each group.
tmp[, .(my_new_col = sum(ifelse(bye > 5 & yes < 20, no, 0))), by = 'hi']

# output 1
   hi my_new_col
1:  1          1
2:  2        116
3:  3          3
4:  5         95
5:  6          6

# Similarly, we can even group by a rule instead of creating a new col to group by. See below

tmp[, .(my_new_col = sum(ifelse(bye > 5 & yes < 20, no, 0))), by = .(new_rule = ifelse(hi > 3, 1, 0))]

# output 2
   new_rule my_new_col
1:        0        120
2:        1        101

# We can even apply multiple aggregate functions in parallel using data.table
agg_fns <- function(x) list(sum=sum(as.double(x), na.rm=T),
                            mean=mean(as.double(x), na.rm=T),
                            min=min(as.double(x), na.rm=T),
                            max=max(as.double(x), na.rm=T))

tmp[,
    unlist(
        list(N = .N, # add a N column (row count) to the summary
            unlist(mclapply(.SD, agg_fns, mc.cores = 12), recursive = F)), # apply all agg_fns over all .SDcols
    recursive = F),
    .SDcols = !unique(c(names('hi'), as.character(unlist('hi'))))]

output 3:
   N bye.sum bye.mean bye.min bye.max yes.sum yes.mean yes.min yes.max no.sum  no.mean no.min
1: 11     340 30.90909      12      62     140 12.72727       2      32    381 34.63636      1
   no.max maybe.sum maybe.mean maybe.min maybe.max
1:     95       393   35.72727         1        95

我们在 python 中有同样的灵活性吗？

Answer 1

您可以在所有需要的列上使用 agg 并添加前缀：

tmp.groupby('hi').agg('sum').add_prefix('total_')

输出：

    total_bye  total_yes  total_no  total_maybe
hi                                             
1          24         33         2           92
2          67          6       116            6
3         134         90       162          131
5          53          5        95           95
6          62          6         6           69

您甚至可以使用字典灵活组合列和操作：

tmp.groupby('hi').agg(**{'%s_%s' % (label,c):  (c, op)
                         for c in tmp.columns
                         for (label,op) in [('total', 'sum'), ('average', 'mean')]
                        })

输出：

    total_hi  average_hi  total_bye  average_bye  total_yes  average_yes  total_no  average_no  total_maybe  average_maybe
hi                                                                                                                        
1          2           1         24    12.000000         33         16.5         2    1.000000           92          46.00
2          6           2         67    22.333333          6          2.0       116   38.666667            6           2.00
3         12           3        134    33.500000         90         22.5       162   40.500000          131          32.75
5          5           5         53    53.000000          5          5.0        95   95.000000           95          95.00
6          6           6         62    62.000000          6          6.0         6    6.000000           69          69.00

应用引用多列的 groupby 最快最有效的方法

Fastest most efficient way to apply groupby that references multiple columns

python

parallel-processing

r

data-manipulation

pandas