如何按组计算所有列的平均值?
How to calculate mean of all columns, by group?
我需要使用 R 获取大型数据集所有列的平均值,按 2 个变量分组。
让我们用 mtcars 试试:
library(dplyr)
g_mtcars <- group_by(mtcars, cyl, gear)
summarise(g_mtcars, mean (hp))
# Source: local data frame [8 x 3]
# Groups: cyl [?]
#
# cyl gear `mean(hp)`
# <dbl> <dbl> <dbl>
# 1 4 3 97.0000
# 2 4 4 76.0000
# 3 4 5 102.0000
# 4 6 3 107.5000
# 5 6 4 116.5000
# 6 6 5 175.0000
# 7 8 3 194.1667
# 8 8 5 299.5000
它适用于“hp”,但我需要获取 mtcars 的所有其他列的平均值(组成一个组的“cyl”和“gear”除外)。
数据集很大,有几列。像这样手动输入:summarise(g_mtcars, mean (hp), mean(drat), mean (wt),...)
不实用。
使用 data.table。(但是你不能 setDT(mtcars)
因为绑定被锁定。将它复制到不同的名称,如 dt 并尝试
library(data.table)
mt_dt = as.data.table(mtcars)
mt_dt[ , lapply(.SD, mean) , by=c("cyl", "gear")]
您可以在 dplyr::summarize
中使用多个均值语句,如下所示:
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
summarize(mean_hp = mean(hp), mean_wt = mean(wt))
# Source: local data frame [8 x 4]
# Groups: cyl [?]
# cyl gear mean_hp mean_wt
# <dbl> <dbl> <dbl> <dbl>
# 1 4 3 97.0000 2.465000
# 2 4 4 76.0000 2.378125
# 3 4 5 102.0000 1.826500
# 4 6 3 107.5000 3.337500
# 5 6 4 116.5000 3.093750
# 6 6 5 175.0000 2.770000
# 7 8 3 194.1667 4.104083
# 8 8 5 299.5000 3.370000
Edit2:dplyr
的最新版本建议使用带有 across
函数的常规 summarise
,如:
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(everything(), mean))
您要查找的是来自 dplyr
的 ?summarise_all
或 ?summarise_each
编辑:完整代码:
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
summarise_all("mean")
# Source: local data frame [8 x 11]
# Groups: cyl [?]
#
# cyl gear mpg disp hp drat wt qsec vs am carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 3 21.500 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 1.000000
# 2 4 4 26.925 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 1.500000
# 3 4 5 28.200 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 2.000000
# 4 6 3 19.750 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 1.000000
# 5 6 4 19.750 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4.000000
# 6 6 5 19.700 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 6.000000
# 7 8 3 15.050 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3.083333
# 8 8 5 15.400 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 6.000000
为了完整起见,您可以使用包 plyr
并执行此操作:
library(plyr)
ddply(mtcars,c('cyl','gear'), summarize,mean_hp=mean(hp))
aggregate
是 base
中最简单的方法:
aggregate(. ~ cyl + gear, data = mtcars, FUN = mean)
# cyl gear mpg disp hp drat wt qsec vs am carb
# 1 4 3 21.500 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 1.000000
# 2 6 3 19.750 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 1.000000
# 3 8 3 15.050 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3.083333
# 4 4 4 26.925 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 1.500000
# 5 6 4 19.750 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4.000000
# 6 4 5 28.200 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 2.000000
# 7 6 5 19.700 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 6.000000
# 8 8 5 15.400 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 6.000000
我需要使用 R 获取大型数据集所有列的平均值,按 2 个变量分组。
让我们用 mtcars 试试:
library(dplyr)
g_mtcars <- group_by(mtcars, cyl, gear)
summarise(g_mtcars, mean (hp))
# Source: local data frame [8 x 3]
# Groups: cyl [?]
#
# cyl gear `mean(hp)`
# <dbl> <dbl> <dbl>
# 1 4 3 97.0000
# 2 4 4 76.0000
# 3 4 5 102.0000
# 4 6 3 107.5000
# 5 6 4 116.5000
# 6 6 5 175.0000
# 7 8 3 194.1667
# 8 8 5 299.5000
它适用于“hp”,但我需要获取 mtcars 的所有其他列的平均值(组成一个组的“cyl”和“gear”除外)。
数据集很大,有几列。像这样手动输入:summarise(g_mtcars, mean (hp), mean(drat), mean (wt),...)
不实用。
使用 data.table。(但是你不能 setDT(mtcars)
因为绑定被锁定。将它复制到不同的名称,如 dt 并尝试
library(data.table)
mt_dt = as.data.table(mtcars)
mt_dt[ , lapply(.SD, mean) , by=c("cyl", "gear")]
您可以在 dplyr::summarize
中使用多个均值语句,如下所示:
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
summarize(mean_hp = mean(hp), mean_wt = mean(wt))
# Source: local data frame [8 x 4]
# Groups: cyl [?]
# cyl gear mean_hp mean_wt
# <dbl> <dbl> <dbl> <dbl>
# 1 4 3 97.0000 2.465000
# 2 4 4 76.0000 2.378125
# 3 4 5 102.0000 1.826500
# 4 6 3 107.5000 3.337500
# 5 6 4 116.5000 3.093750
# 6 6 5 175.0000 2.770000
# 7 8 3 194.1667 4.104083
# 8 8 5 299.5000 3.370000
Edit2:dplyr
的最新版本建议使用带有 across
函数的常规 summarise
,如:
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(everything(), mean))
您要查找的是来自 dplyr
?summarise_all
或 ?summarise_each
编辑:完整代码:
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
summarise_all("mean")
# Source: local data frame [8 x 11]
# Groups: cyl [?]
#
# cyl gear mpg disp hp drat wt qsec vs am carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 3 21.500 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 1.000000
# 2 4 4 26.925 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 1.500000
# 3 4 5 28.200 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 2.000000
# 4 6 3 19.750 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 1.000000
# 5 6 4 19.750 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4.000000
# 6 6 5 19.700 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 6.000000
# 7 8 3 15.050 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3.083333
# 8 8 5 15.400 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 6.000000
为了完整起见,您可以使用包 plyr
并执行此操作:
library(plyr)
ddply(mtcars,c('cyl','gear'), summarize,mean_hp=mean(hp))
aggregate
是 base
中最简单的方法:
aggregate(. ~ cyl + gear, data = mtcars, FUN = mean)
# cyl gear mpg disp hp drat wt qsec vs am carb
# 1 4 3 21.500 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 1.000000
# 2 6 3 19.750 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 1.000000
# 3 8 3 15.050 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3.083333
# 4 4 4 26.925 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 1.500000
# 5 6 4 19.750 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4.000000
# 6 4 5 28.200 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 2.000000
# 7 6 5 19.700 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 6.000000
# 8 8 5 15.400 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 6.000000