应用引用多列的 groupby 最快最有效的方法
Fastest most efficient way to apply groupby that references multiple columns
假设我们有一个数据集。
tmp = pd.DataFrame({'hi': [1,2,3,3,5,6,3,2,3,2,1],
'bye': [12,23,35,35,53,62,31,22,33,22,12],
'yes': [12,2,32,3,5,6,23,2,32,2,21],
'no': [1,92,93,3,95,6,33,2,33,22,1],
'maybe': [91,2,32,3,95,69,3,2,93,2,1]})
在python中我们可以很容易地tmp.groupby('hi').agg(total_bye = ('bye', sum))
得到每组的轮空总和。但是,如果我想引用多个列,在 python 中执行此操作的最快、最有效和最少的干净(易于阅读)编写代码是什么?特别是,我可以使用 df.groupby(my_cols).agg() 来做到这一点吗?最快的替代品是什么?我愿意(实际上更喜欢)使用比 pandas 更快的库,例如 dask 或 vaex。
例如,在 R data.table 中我们可以很容易地做到这一点,而且速度非常快
# In R, assume this object is a data.table
# In a single line, the below code groups by 'hi' and then creates my_new_col column based on if bye > 5 and yes <= 20, taking the sum of 'no' for each group.
tmp[, .(my_new_col = sum(ifelse(bye > 5 & yes < 20, no, 0))), by = 'hi']
# output 1
hi my_new_col
1: 1 1
2: 2 116
3: 3 3
4: 5 95
5: 6 6
# Similarly, we can even group by a rule instead of creating a new col to group by. See below
tmp[, .(my_new_col = sum(ifelse(bye > 5 & yes < 20, no, 0))), by = .(new_rule = ifelse(hi > 3, 1, 0))]
# output 2
new_rule my_new_col
1: 0 120
2: 1 101
# We can even apply multiple aggregate functions in parallel using data.table
agg_fns <- function(x) list(sum=sum(as.double(x), na.rm=T),
mean=mean(as.double(x), na.rm=T),
min=min(as.double(x), na.rm=T),
max=max(as.double(x), na.rm=T))
tmp[,
unlist(
list(N = .N, # add a N column (row count) to the summary
unlist(mclapply(.SD, agg_fns, mc.cores = 12), recursive = F)), # apply all agg_fns over all .SDcols
recursive = F),
.SDcols = !unique(c(names('hi'), as.character(unlist('hi'))))]
output 3:
N bye.sum bye.mean bye.min bye.max yes.sum yes.mean yes.min yes.max no.sum no.mean no.min
1: 11 340 30.90909 12 62 140 12.72727 2 32 381 34.63636 1
no.max maybe.sum maybe.mean maybe.min maybe.max
1: 95 393 35.72727 1 95
我们在 python 中有同样的灵活性吗?
您可以在所有需要的列上使用 agg 并添加前缀:
tmp.groupby('hi').agg('sum').add_prefix('total_')
输出:
total_bye total_yes total_no total_maybe
hi
1 24 33 2 92
2 67 6 116 6
3 134 90 162 131
5 53 5 95 95
6 62 6 6 69
您甚至可以使用字典灵活组合列和操作:
tmp.groupby('hi').agg(**{'%s_%s' % (label,c): (c, op)
for c in tmp.columns
for (label,op) in [('total', 'sum'), ('average', 'mean')]
})
输出:
total_hi average_hi total_bye average_bye total_yes average_yes total_no average_no total_maybe average_maybe
hi
1 2 1 24 12.000000 33 16.5 2 1.000000 92 46.00
2 6 2 67 22.333333 6 2.0 116 38.666667 6 2.00
3 12 3 134 33.500000 90 22.5 162 40.500000 131 32.75
5 5 5 53 53.000000 5 5.0 95 95.000000 95 95.00
6 6 6 62 62.000000 6 6.0 6 6.000000 69 69.00
假设我们有一个数据集。
tmp = pd.DataFrame({'hi': [1,2,3,3,5,6,3,2,3,2,1],
'bye': [12,23,35,35,53,62,31,22,33,22,12],
'yes': [12,2,32,3,5,6,23,2,32,2,21],
'no': [1,92,93,3,95,6,33,2,33,22,1],
'maybe': [91,2,32,3,95,69,3,2,93,2,1]})
在python中我们可以很容易地tmp.groupby('hi').agg(total_bye = ('bye', sum))
得到每组的轮空总和。但是,如果我想引用多个列,在 python 中执行此操作的最快、最有效和最少的干净(易于阅读)编写代码是什么?特别是,我可以使用 df.groupby(my_cols).agg() 来做到这一点吗?最快的替代品是什么?我愿意(实际上更喜欢)使用比 pandas 更快的库,例如 dask 或 vaex。
例如,在 R data.table 中我们可以很容易地做到这一点,而且速度非常快
# In R, assume this object is a data.table
# In a single line, the below code groups by 'hi' and then creates my_new_col column based on if bye > 5 and yes <= 20, taking the sum of 'no' for each group.
tmp[, .(my_new_col = sum(ifelse(bye > 5 & yes < 20, no, 0))), by = 'hi']
# output 1
hi my_new_col
1: 1 1
2: 2 116
3: 3 3
4: 5 95
5: 6 6
# Similarly, we can even group by a rule instead of creating a new col to group by. See below
tmp[, .(my_new_col = sum(ifelse(bye > 5 & yes < 20, no, 0))), by = .(new_rule = ifelse(hi > 3, 1, 0))]
# output 2
new_rule my_new_col
1: 0 120
2: 1 101
# We can even apply multiple aggregate functions in parallel using data.table
agg_fns <- function(x) list(sum=sum(as.double(x), na.rm=T),
mean=mean(as.double(x), na.rm=T),
min=min(as.double(x), na.rm=T),
max=max(as.double(x), na.rm=T))
tmp[,
unlist(
list(N = .N, # add a N column (row count) to the summary
unlist(mclapply(.SD, agg_fns, mc.cores = 12), recursive = F)), # apply all agg_fns over all .SDcols
recursive = F),
.SDcols = !unique(c(names('hi'), as.character(unlist('hi'))))]
output 3:
N bye.sum bye.mean bye.min bye.max yes.sum yes.mean yes.min yes.max no.sum no.mean no.min
1: 11 340 30.90909 12 62 140 12.72727 2 32 381 34.63636 1
no.max maybe.sum maybe.mean maybe.min maybe.max
1: 95 393 35.72727 1 95
我们在 python 中有同样的灵活性吗?
您可以在所有需要的列上使用 agg 并添加前缀:
tmp.groupby('hi').agg('sum').add_prefix('total_')
输出:
total_bye total_yes total_no total_maybe
hi
1 24 33 2 92
2 67 6 116 6
3 134 90 162 131
5 53 5 95 95
6 62 6 6 69
您甚至可以使用字典灵活组合列和操作:
tmp.groupby('hi').agg(**{'%s_%s' % (label,c): (c, op)
for c in tmp.columns
for (label,op) in [('total', 'sum'), ('average', 'mean')]
})
输出:
total_hi average_hi total_bye average_bye total_yes average_yes total_no average_no total_maybe average_maybe
hi
1 2 1 24 12.000000 33 16.5 2 1.000000 92 46.00
2 6 2 67 22.333333 6 2.0 116 38.666667 6 2.00
3 12 3 134 33.500000 90 22.5 162 40.500000 131 32.75
5 5 5 53 53.000000 5 5.0 95 95.000000 95 95.00
6 6 6 62 62.000000 6 6.0 6 6.000000 69 69.00