在 R 中创建和查询分组摘要的最有效方法及其 Python 等价物

Most efficient way to create and query a grouped summary in R, and its Python equivalent

假设您有一个数值数组,以及一个对应的整数数组,表示每个数值所属的组。

您想按组获取值的平均值,并确定哪个组的平均值最高。
[注意:我们可能需要 'median' 或其他一些函数来代替 'mean' 来进行分组摘要。

这是我在 R 中可以想到的:

g <- c(7,1,0,2,1,1,7,4,4,1)
v <- c(0.35,0.2,0.45,0.5,0.43,0.57,0.62,0.11,0.23,0.72)

# Goal:
# - get the mean of the values in v grouped by the values in g
# - report the g value for which the grouped mean is maximal

# Using 'by'
b <- by(v, list(g), FUN = mean)
gbest <- dimnames(b)[[1]][which.max(b)]
print(gbest)

# Using 'aggregate'
a <- aggregate(v ~ g, FUN = mean)
gbest <- a[which.max(a$v),"g"]
print(gbest)

# Speed test
set.seed(1234)

by_time <- system.time({
  replicate(1000, {
    v <- runif(10, 0, 1)
    b <- by(v, list(g), FUN = mean)
    gbest <- dimnames(b)[[1]][which.max(b)]
    })
})
print(by_time)

aggregate_time <- system.time({
  replicate(1000, {
    v <- runif(10, 0, 1)
    a <- aggregate(v ~ g, FUN = mean)
    gbest <- a[which.max(a$v),"g"]
  })
})
print(aggregate_time)

在我的电脑上,aggregate 方法比 by 方法慢 2.5 倍。

Do you think this is an efficient way of doing this? Or can you suggest better alternatives?

然后,非常重要的是,我需要在python中找到一种方法来做到这一点。
到目前为止我找到的唯一方法是通过 pandas,使用 groupby.

事实是,这需要尽可能高效,因为它是在循环中使用的。
在每次迭代中,循环都会计算一个新的 v(因此在我上面的代码中使用了 runif)。
在这个例子中,我保留了 g 常量;在实际应用中,它有时是,而在其他情况下,它在每次迭代时通过向其附加一个整数来更新,并且 v 当然长度也会增加 1.

Any suggestions/pointers for a python implementation of this computation?


根据用户 Parfait 的建议使用 tapply 编辑(添加 names 以获得实际所需的结果)。

tapply_time <- system.time({
  replicate(1000, {
    v <- runif(10, 0, 1)
    grp_means <- tapply(v, g, mean)
    gbest <- which(grp_means == max(grp_means))
  })
})

print(tapply_time)

在我的 PC 上,这比 by 快 2.5-2.8 倍,所以绝对更可取。


编辑 python 方法测试,正如用户 Nikita Almakov 和 StupidWolf

所建议的
import numpy as np
import pandas as pd
import time

from convtools import conversion as c

# this writes the necessary code and compiles the function (so do it outside
# the loop)
converter = (
    # here we group by first item of each tuple
    c.group_by(c.item(0))
    .aggregate({
        # here we can store & calculate whatever we want,
        # using fields in group by and any combination of reducers,
        # including custom reduce funcs
        "g": c.item(0),
        # there's a handful of ReduceFuncs -> https://convtools.readthedocs.io/en/latest/cheatsheet.html#reduce-funcs-list
        "v_avg": c.ReduceFuncs.Average(c.item(1))
    })
    .pipe(c.aggregate(c.ReduceFuncs.MaxRow(c.item("v_avg"))))
    .gen_converter(debug=True)  # if black is installed, this will print formatted code
)

g = [7, 1, 0, 2, 1, 1, 7, 4, 4, 1]
v = [0.32, 0.2, 0.45, 0.5, 0.43, 0.57, 0.62, 0.11, 0.23, 0.72]

converter(zip(g, v))['g']    
# 2

v = np.random.uniform(0,1,10); converter(zip(g, v))['g']
# 0

%timeit v = np.random.uniform(0,1,10); converter(zip(g, v))['g']
# 22.6 µs ± 1.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

pd.Series(v).groupby(g).agg('mean').sort_values(ascending=False).index[0]
# 2

%timeit v = np.random.uniform(0,1,10); pd.Series(v).groupby(g).agg('mean').sort_values(ascending=False).index[0]
# 817 µs ± 3.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

R (tapply) 中最快的方法 1000 次重复需要 0.2 秒,所以每个循环大约需要 0.2 微秒,如果我没记错的话.
编辑:是的,我错了!它是 0.2 ms,即每个循环 200 µs。感谢 Nikita 指出!

结论:

您可以查看 convtools python 库,它允许您定义转换,完成后,它会编写和编译临时 python 代码罩,所以你有做你想做的事情的功能。

# pip install convtools
from convtools import conversion as c

# this writes the necessary code and compiles the function (so do it outside
# the loop)
converter = (
    # here we group by first item of each tuple
    c.group_by(c.item(0))
    .aggregate({
        # here we can store & calculate whatever we want,
        # using fields in group by and any combination of reducers,
        # including custom reduce funcs
        "g": c.item(0),
        # there's a handful of ReduceFuncs -> https://convtools.readthedocs.io/en/latest/cheatsheet.html#reduce-funcs-list
        "v_avg": c.ReduceFuncs.Average(c.item(1))
    })
    .pipe(c.aggregate(c.ReduceFuncs.MaxRow(c.item("v_avg"))))
    .gen_converter(debug=True)  # if black is installed, this will print formatted code
)

g = [7, 1, 0, 2, 1, 1, 7, 4, 4, 1]
v = [0.35, 0.2, 0.45, 0.5, 0.43, 0.57, 0.62, 0.11, 0.23, 0.72]

# passing iterable of tuples (g_item, v_item)
result = converter(zip(g, v))
print(result)

如果您有任何问题,请告诉我 - 我很乐意提供帮助!