在 R 中创建和查询分组摘要的最有效方法及其 Python 等价物
Most efficient way to create and query a grouped summary in R, and its Python equivalent
假设您有一个数值数组,以及一个对应的整数数组,表示每个数值所属的组。
您想按组获取值的平均值,并确定哪个组的平均值最高。
[注意:我们可能需要 'median' 或其他一些函数来代替 'mean' 来进行分组摘要。
这是我在 R
中可以想到的:
g <- c(7,1,0,2,1,1,7,4,4,1)
v <- c(0.35,0.2,0.45,0.5,0.43,0.57,0.62,0.11,0.23,0.72)
# Goal:
# - get the mean of the values in v grouped by the values in g
# - report the g value for which the grouped mean is maximal
# Using 'by'
b <- by(v, list(g), FUN = mean)
gbest <- dimnames(b)[[1]][which.max(b)]
print(gbest)
# Using 'aggregate'
a <- aggregate(v ~ g, FUN = mean)
gbest <- a[which.max(a$v),"g"]
print(gbest)
# Speed test
set.seed(1234)
by_time <- system.time({
replicate(1000, {
v <- runif(10, 0, 1)
b <- by(v, list(g), FUN = mean)
gbest <- dimnames(b)[[1]][which.max(b)]
})
})
print(by_time)
aggregate_time <- system.time({
replicate(1000, {
v <- runif(10, 0, 1)
a <- aggregate(v ~ g, FUN = mean)
gbest <- a[which.max(a$v),"g"]
})
})
print(aggregate_time)
在我的电脑上,aggregate
方法比 by
方法慢 2.5 倍。
Do you think this is an efficient way of doing this? Or can you suggest better alternatives?
然后,非常重要的是,我需要在python
中找到一种方法来做到这一点。
到目前为止我找到的唯一方法是通过 pandas
,使用 groupby
.
事实是,这需要尽可能高效,因为它是在循环中使用的。
在每次迭代中,循环都会计算一个新的 v
(因此在我上面的代码中使用了 runif
)。
在这个例子中,我保留了 g
常量;在实际应用中,它有时是,而在其他情况下,它在每次迭代时通过向其附加一个整数来更新,并且 v
当然长度也会增加 1.
Any suggestions/pointers for a python implementation of this computation?
根据用户 Parfait 的建议使用 tapply
编辑(添加 names
以获得实际所需的结果)。
tapply_time <- system.time({
replicate(1000, {
v <- runif(10, 0, 1)
grp_means <- tapply(v, g, mean)
gbest <- which(grp_means == max(grp_means))
})
})
print(tapply_time)
在我的 PC 上,这比 by
快 2.5-2.8 倍,所以绝对更可取。
编辑 python
方法测试,正如用户 Nikita Almakov 和 StupidWolf
所建议的
import numpy as np
import pandas as pd
import time
from convtools import conversion as c
# this writes the necessary code and compiles the function (so do it outside
# the loop)
converter = (
# here we group by first item of each tuple
c.group_by(c.item(0))
.aggregate({
# here we can store & calculate whatever we want,
# using fields in group by and any combination of reducers,
# including custom reduce funcs
"g": c.item(0),
# there's a handful of ReduceFuncs -> https://convtools.readthedocs.io/en/latest/cheatsheet.html#reduce-funcs-list
"v_avg": c.ReduceFuncs.Average(c.item(1))
})
.pipe(c.aggregate(c.ReduceFuncs.MaxRow(c.item("v_avg"))))
.gen_converter(debug=True) # if black is installed, this will print formatted code
)
g = [7, 1, 0, 2, 1, 1, 7, 4, 4, 1]
v = [0.32, 0.2, 0.45, 0.5, 0.43, 0.57, 0.62, 0.11, 0.23, 0.72]
converter(zip(g, v))['g']
# 2
v = np.random.uniform(0,1,10); converter(zip(g, v))['g']
# 0
%timeit v = np.random.uniform(0,1,10); converter(zip(g, v))['g']
# 22.6 µs ± 1.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
pd.Series(v).groupby(g).agg('mean').sort_values(ascending=False).index[0]
# 2
%timeit v = np.random.uniform(0,1,10); pd.Series(v).groupby(g).agg('mean').sort_values(ascending=False).index[0]
# 817 µs ± 3.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
R
(tapply
) 中最快的方法 1000 次重复需要 0.2 秒,所以每个循环大约需要 0.2 微秒,如果我没记错的话.
编辑:是的,我错了!它是 0.2 ms,即每个循环 200 µs。感谢 Nikita 指出!
结论:
- 无需在每次迭代时创建
pandas
数据框即可在 python
中实现此计算
- 目前看来最好的方法是
convtools
根据上面的编辑,它仍然比 R 慢大约 100 倍,convtools
比 R
tapply
快大约 10 倍; pandas
比 R
tapply
慢 4 倍(顺便说一句,我检查过:numpy
生成的随机统一 v
几乎没有(相对)影响在总时间上,每个循环大约 3.5 微秒)
您可以查看 convtools python 库,它允许您定义转换,完成后,它会编写和编译临时 python 代码罩,所以你有做你想做的事情的功能。
# pip install convtools
from convtools import conversion as c
# this writes the necessary code and compiles the function (so do it outside
# the loop)
converter = (
# here we group by first item of each tuple
c.group_by(c.item(0))
.aggregate({
# here we can store & calculate whatever we want,
# using fields in group by and any combination of reducers,
# including custom reduce funcs
"g": c.item(0),
# there's a handful of ReduceFuncs -> https://convtools.readthedocs.io/en/latest/cheatsheet.html#reduce-funcs-list
"v_avg": c.ReduceFuncs.Average(c.item(1))
})
.pipe(c.aggregate(c.ReduceFuncs.MaxRow(c.item("v_avg"))))
.gen_converter(debug=True) # if black is installed, this will print formatted code
)
g = [7, 1, 0, 2, 1, 1, 7, 4, 4, 1]
v = [0.35, 0.2, 0.45, 0.5, 0.43, 0.57, 0.62, 0.11, 0.23, 0.72]
# passing iterable of tuples (g_item, v_item)
result = converter(zip(g, v))
print(result)
如果您有任何问题,请告诉我 - 我很乐意提供帮助!
假设您有一个数值数组,以及一个对应的整数数组,表示每个数值所属的组。
您想按组获取值的平均值,并确定哪个组的平均值最高。
[注意:我们可能需要 'median' 或其他一些函数来代替 'mean' 来进行分组摘要。
这是我在 R
中可以想到的:
g <- c(7,1,0,2,1,1,7,4,4,1)
v <- c(0.35,0.2,0.45,0.5,0.43,0.57,0.62,0.11,0.23,0.72)
# Goal:
# - get the mean of the values in v grouped by the values in g
# - report the g value for which the grouped mean is maximal
# Using 'by'
b <- by(v, list(g), FUN = mean)
gbest <- dimnames(b)[[1]][which.max(b)]
print(gbest)
# Using 'aggregate'
a <- aggregate(v ~ g, FUN = mean)
gbest <- a[which.max(a$v),"g"]
print(gbest)
# Speed test
set.seed(1234)
by_time <- system.time({
replicate(1000, {
v <- runif(10, 0, 1)
b <- by(v, list(g), FUN = mean)
gbest <- dimnames(b)[[1]][which.max(b)]
})
})
print(by_time)
aggregate_time <- system.time({
replicate(1000, {
v <- runif(10, 0, 1)
a <- aggregate(v ~ g, FUN = mean)
gbest <- a[which.max(a$v),"g"]
})
})
print(aggregate_time)
在我的电脑上,aggregate
方法比 by
方法慢 2.5 倍。
Do you think this is an efficient way of doing this? Or can you suggest better alternatives?
然后,非常重要的是,我需要在python
中找到一种方法来做到这一点。
到目前为止我找到的唯一方法是通过 pandas
,使用 groupby
.
事实是,这需要尽可能高效,因为它是在循环中使用的。
在每次迭代中,循环都会计算一个新的 v
(因此在我上面的代码中使用了 runif
)。
在这个例子中,我保留了 g
常量;在实际应用中,它有时是,而在其他情况下,它在每次迭代时通过向其附加一个整数来更新,并且 v
当然长度也会增加 1.
Any suggestions/pointers for a python implementation of this computation?
根据用户 Parfait 的建议使用 tapply
编辑(添加 names
以获得实际所需的结果)。
tapply_time <- system.time({
replicate(1000, {
v <- runif(10, 0, 1)
grp_means <- tapply(v, g, mean)
gbest <- which(grp_means == max(grp_means))
})
})
print(tapply_time)
在我的 PC 上,这比 by
快 2.5-2.8 倍,所以绝对更可取。
编辑 python
方法测试,正如用户 Nikita Almakov 和 StupidWolf
import numpy as np
import pandas as pd
import time
from convtools import conversion as c
# this writes the necessary code and compiles the function (so do it outside
# the loop)
converter = (
# here we group by first item of each tuple
c.group_by(c.item(0))
.aggregate({
# here we can store & calculate whatever we want,
# using fields in group by and any combination of reducers,
# including custom reduce funcs
"g": c.item(0),
# there's a handful of ReduceFuncs -> https://convtools.readthedocs.io/en/latest/cheatsheet.html#reduce-funcs-list
"v_avg": c.ReduceFuncs.Average(c.item(1))
})
.pipe(c.aggregate(c.ReduceFuncs.MaxRow(c.item("v_avg"))))
.gen_converter(debug=True) # if black is installed, this will print formatted code
)
g = [7, 1, 0, 2, 1, 1, 7, 4, 4, 1]
v = [0.32, 0.2, 0.45, 0.5, 0.43, 0.57, 0.62, 0.11, 0.23, 0.72]
converter(zip(g, v))['g']
# 2
v = np.random.uniform(0,1,10); converter(zip(g, v))['g']
# 0
%timeit v = np.random.uniform(0,1,10); converter(zip(g, v))['g']
# 22.6 µs ± 1.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
pd.Series(v).groupby(g).agg('mean').sort_values(ascending=False).index[0]
# 2
%timeit v = np.random.uniform(0,1,10); pd.Series(v).groupby(g).agg('mean').sort_values(ascending=False).index[0]
# 817 µs ± 3.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
R
(tapply
) 中最快的方法 1000 次重复需要 0.2 秒,所以每个循环大约需要 0.2 微秒,如果我没记错的话.
编辑:是的,我错了!它是 0.2 ms,即每个循环 200 µs。感谢 Nikita 指出!
结论:
- 无需在每次迭代时创建
pandas
数据框即可在python
中实现此计算 - 目前看来最好的方法是
convtools
根据上面的编辑,它仍然比 R慢大约 100 倍,convtools
比R
tapply
快大约 10 倍;pandas
比R
tapply
慢 4 倍(顺便说一句,我检查过:numpy
生成的随机统一v
几乎没有(相对)影响在总时间上,每个循环大约 3.5 微秒)
您可以查看 convtools python 库,它允许您定义转换,完成后,它会编写和编译临时 python 代码罩,所以你有做你想做的事情的功能。
# pip install convtools
from convtools import conversion as c
# this writes the necessary code and compiles the function (so do it outside
# the loop)
converter = (
# here we group by first item of each tuple
c.group_by(c.item(0))
.aggregate({
# here we can store & calculate whatever we want,
# using fields in group by and any combination of reducers,
# including custom reduce funcs
"g": c.item(0),
# there's a handful of ReduceFuncs -> https://convtools.readthedocs.io/en/latest/cheatsheet.html#reduce-funcs-list
"v_avg": c.ReduceFuncs.Average(c.item(1))
})
.pipe(c.aggregate(c.ReduceFuncs.MaxRow(c.item("v_avg"))))
.gen_converter(debug=True) # if black is installed, this will print formatted code
)
g = [7, 1, 0, 2, 1, 1, 7, 4, 4, 1]
v = [0.35, 0.2, 0.45, 0.5, 0.43, 0.57, 0.62, 0.11, 0.23, 0.72]
# passing iterable of tuples (g_item, v_item)
result = converter(zip(g, v))
print(result)
如果您有任何问题,请告诉我 - 我很乐意提供帮助!