R中具有多个分组因子的多个变量的均值和标准差
Mean and sd of multiple variables with multiple grouping factors in R
我一直在寻找答案,但我仍然没有找到解决方案,我还是 R 的新手。
我的数据框显示了 70 种植物在不同条件下的一种生态特征(相对土壤覆盖率)的测量值:不同年份、不同化学处理和 presence/absence 温室。
我需要将该数据汇总到一个新的数据框中,该数据框显示每个物种和每个因素(条件)组合的特征的均值和标准差。我知道 aggregate
或 lapply
可能会有所帮助,但我很难将 3 个不同因素和多个物种的分组结合起来,这意味着需要 "automated" 代码。
如果我错过了 post 回答我的问题,我很抱歉
感谢您的耐心等待和帮助
编辑:这是一个可重现的例子,希望我做对了:
mydata<-structure(list(Year = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L,
2011L), Replicate = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L), Treatment = structure(c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), Greenhouse = structure(c(2L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), Sp_1 = c(4L, 0L, 2L, 5L, 4L, 0L, 2L,
5L, 0L, 0L, 4L, 6L, 4L, 0L, 2L, 5L), Sp_2 = c(7L, 0L, 1L, 1L,
7L, 0L, 1L, 1L, 7L, 0L, 1L, 1L, 6L, 0L, 1L, 1L), Sp_3 = c(8L,
2L, 2L, 1L, 8L, 2L, 2L, 1L, 10L, 2L, 1L, 1L, 4L, 2L, 2L, 1L)), class = "data.frame", row.names = c(NA,
-16L))
我在那个例子中只放了 3 个物种,但正如我所说的,我有超过 70 个物种,所以我需要一些可以 select 所有物种列的东西( mydata[,5:75]
?沿着这些线的东西) 超过 c("sp_1","sp_2",..., "sp_70")
。
我希望输出如下所示:
Year Treatment Greenhouse Sp_1_mean Sp_1_sd Sp_2_mean Sp_2_sd
2010 A Yes x x x x
2010 A No x x x x
2010 B Yes x x x x
2010 B No x x x x
2011 A Yes x x x x
2011 A No x x x x
2011 B Yes x x x x
2011 B No x x x x
这是一个 dput()
显示所需输出的样子
desired_output<-structure(list(Year = c(2010L, 2010L, 2010L, 2010L, 2011L, 2011L,
2011L, 2011L), Treatment = structure(c(1L, 1L, 2L, 2L, 1L, 1L,
2L, 2L), .Label = c("A", "B"), class = "factor"), Greenhouse = structure(c(2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("No", "Yes"), class = "factor"),
Sp_1_mean = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"),
Sp_1_sd = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"),
Sp_2_mean = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"),
Sp_2_sd = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"),
Sp_3_mean = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"),
Sp_3_sd = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
我希望这更清楚!谢谢
有了data.table
,你可以做一些这样的事情:
library(data.table)
setDT(df)
df[, lapply(.SD, function(x) return(c(mean(x), sd(x))),
by = c("col1","col2"),
.SDcols = c("x1","x2")]
(没有可重现的例子很难给你更精确的语法)
这意味着:按组(col1
和 col2
)对每个数据子集(此处为列 x1
和 x2
)应用均值和标准偏差
例子
library(data.table)
df <- as.data.table(mtcars)
output <- df[, lapply(.SD, function(x) return(c(mean(x, na.rm = TRUE), sd(x, na.rm = TRUE)))),
.SDcols = c("disp","drat"),
by = c("cyl","gear")]
output[, 'stat' := c("mean","sd"), by = c("cyl","gear")]
output
cyl gear disp drat stat
1: 6 4 163.800000 3.91000000 mean
2: 6 4 4.387862 0.01154701 sd
3: 4 4 102.625000 4.11000000 mean
4: 4 4 30.742699 0.37156042 sd
5: 6 3 241.500000 2.92000000 mean
6: 6 3 23.334524 0.22627417 sd
7: 8 3 357.616667 3.12083333 mean
8: 8 3 71.823494 0.23027487 sd
9: 4 3 120.100000 3.70000000 mean
10: 4 3 NA NA sd
11: 4 5 107.700000 4.10000000 mean
12: 4 5 17.819091 0.46669048 sd
13: 8 5 326.000000 3.88000000 mean
14: 8 5 35.355339 0.48083261 sd
15: 6 5 145.000000 3.62000000 mean
16: 6 5 NA NA sd
在这里,我有一个列来了解每一行的统计信息
使用可重现的示例进行编辑
setDT(mydata)
output <- mydata[, lapply(.SD, function(x) return(c(mean(x, na.rm = TRUE), sd(x, na.rm = TRUE)))),
.SDcols = c("Sp_1", "Sp_2", "Sp_3"),
by = c("Year", "Treatment", "Greenhouse")
]
output[, 'stat' := c('mean','sd') ,
by = c("Year", "Treatment", "Greenhouse")]
由于您对宽幅格式感兴趣,因此可以使用 dcast
重塑数据。
output <- dcast(output, Year + Treatment + Greenhouse ~ ...,
value.var = c("Sp_1", "Sp_2", "Sp_3"))
output
Year Treatment Greenhouse Sp_1_mean Sp_1_sd Sp_2_mean Sp_2_sd Sp_3_mean Sp_3_sd
1: 2010 A No 2.0 0.0000000 1.0 0.0000000 2.0 0.0000000
2: 2010 A Yes 4.0 0.0000000 7.0 0.0000000 8.0 0.0000000
3: 2010 B No 5.0 0.0000000 1.0 0.0000000 1.0 0.0000000
4: 2010 B Yes 0.0 0.0000000 0.0 0.0000000 2.0 0.0000000
5: 2011 A No 3.0 1.4142136 1.0 0.0000000 1.5 0.7071068
6: 2011 A Yes 2.0 2.8284271 6.5 0.7071068 7.0 4.2426407
7: 2011 B No 5.5 0.7071068 1.0 0.0000000 1.0 0.0000000
8: 2011 B Yes 0.0 0.0000000 0.0 0.0000000 2.0 0.0000000
只要稍微修改聚合,就可以避免这种从长到宽的转换。
我一直在寻找答案,但我仍然没有找到解决方案,我还是 R 的新手。 我的数据框显示了 70 种植物在不同条件下的一种生态特征(相对土壤覆盖率)的测量值:不同年份、不同化学处理和 presence/absence 温室。
我需要将该数据汇总到一个新的数据框中,该数据框显示每个物种和每个因素(条件)组合的特征的均值和标准差。我知道 aggregate
或 lapply
可能会有所帮助,但我很难将 3 个不同因素和多个物种的分组结合起来,这意味着需要 "automated" 代码。
如果我错过了 post 回答我的问题,我很抱歉
感谢您的耐心等待和帮助
编辑:这是一个可重现的例子,希望我做对了:
mydata<-structure(list(Year = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L,
2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L,
2011L), Replicate = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L), Treatment = structure(c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), Greenhouse = structure(c(2L, 2L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor"), Sp_1 = c(4L, 0L, 2L, 5L, 4L, 0L, 2L,
5L, 0L, 0L, 4L, 6L, 4L, 0L, 2L, 5L), Sp_2 = c(7L, 0L, 1L, 1L,
7L, 0L, 1L, 1L, 7L, 0L, 1L, 1L, 6L, 0L, 1L, 1L), Sp_3 = c(8L,
2L, 2L, 1L, 8L, 2L, 2L, 1L, 10L, 2L, 1L, 1L, 4L, 2L, 2L, 1L)), class = "data.frame", row.names = c(NA,
-16L))
我在那个例子中只放了 3 个物种,但正如我所说的,我有超过 70 个物种,所以我需要一些可以 select 所有物种列的东西( mydata[,5:75]
?沿着这些线的东西) 超过 c("sp_1","sp_2",..., "sp_70")
。
我希望输出如下所示:
Year Treatment Greenhouse Sp_1_mean Sp_1_sd Sp_2_mean Sp_2_sd
2010 A Yes x x x x
2010 A No x x x x
2010 B Yes x x x x
2010 B No x x x x
2011 A Yes x x x x
2011 A No x x x x
2011 B Yes x x x x
2011 B No x x x x
这是一个 dput()
显示所需输出的样子
desired_output<-structure(list(Year = c(2010L, 2010L, 2010L, 2010L, 2011L, 2011L,
2011L, 2011L), Treatment = structure(c(1L, 1L, 2L, 2L, 1L, 1L,
2L, 2L), .Label = c("A", "B"), class = "factor"), Greenhouse = structure(c(2L,
1L, 2L, 1L, 2L, 1L, 2L, 1L), .Label = c("No", "Yes"), class = "factor"),
Sp_1_mean = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"),
Sp_1_sd = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"),
Sp_2_mean = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"),
Sp_2_sd = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"),
Sp_3_mean = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor"),
Sp_3_sd = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "x", class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
我希望这更清楚!谢谢
有了data.table
,你可以做一些这样的事情:
library(data.table)
setDT(df)
df[, lapply(.SD, function(x) return(c(mean(x), sd(x))),
by = c("col1","col2"),
.SDcols = c("x1","x2")]
(没有可重现的例子很难给你更精确的语法)
这意味着:按组(col1
和 col2
)对每个数据子集(此处为列 x1
和 x2
)应用均值和标准偏差
例子
library(data.table)
df <- as.data.table(mtcars)
output <- df[, lapply(.SD, function(x) return(c(mean(x, na.rm = TRUE), sd(x, na.rm = TRUE)))),
.SDcols = c("disp","drat"),
by = c("cyl","gear")]
output[, 'stat' := c("mean","sd"), by = c("cyl","gear")]
output
cyl gear disp drat stat
1: 6 4 163.800000 3.91000000 mean
2: 6 4 4.387862 0.01154701 sd
3: 4 4 102.625000 4.11000000 mean
4: 4 4 30.742699 0.37156042 sd
5: 6 3 241.500000 2.92000000 mean
6: 6 3 23.334524 0.22627417 sd
7: 8 3 357.616667 3.12083333 mean
8: 8 3 71.823494 0.23027487 sd
9: 4 3 120.100000 3.70000000 mean
10: 4 3 NA NA sd
11: 4 5 107.700000 4.10000000 mean
12: 4 5 17.819091 0.46669048 sd
13: 8 5 326.000000 3.88000000 mean
14: 8 5 35.355339 0.48083261 sd
15: 6 5 145.000000 3.62000000 mean
16: 6 5 NA NA sd
在这里,我有一个列来了解每一行的统计信息
使用可重现的示例进行编辑
setDT(mydata)
output <- mydata[, lapply(.SD, function(x) return(c(mean(x, na.rm = TRUE), sd(x, na.rm = TRUE)))),
.SDcols = c("Sp_1", "Sp_2", "Sp_3"),
by = c("Year", "Treatment", "Greenhouse")
]
output[, 'stat' := c('mean','sd') ,
by = c("Year", "Treatment", "Greenhouse")]
由于您对宽幅格式感兴趣,因此可以使用 dcast
重塑数据。
output <- dcast(output, Year + Treatment + Greenhouse ~ ...,
value.var = c("Sp_1", "Sp_2", "Sp_3"))
output
Year Treatment Greenhouse Sp_1_mean Sp_1_sd Sp_2_mean Sp_2_sd Sp_3_mean Sp_3_sd
1: 2010 A No 2.0 0.0000000 1.0 0.0000000 2.0 0.0000000
2: 2010 A Yes 4.0 0.0000000 7.0 0.0000000 8.0 0.0000000
3: 2010 B No 5.0 0.0000000 1.0 0.0000000 1.0 0.0000000
4: 2010 B Yes 0.0 0.0000000 0.0 0.0000000 2.0 0.0000000
5: 2011 A No 3.0 1.4142136 1.0 0.0000000 1.5 0.7071068
6: 2011 A Yes 2.0 2.8284271 6.5 0.7071068 7.0 4.2426407
7: 2011 B No 5.5 0.7071068 1.0 0.0000000 1.0 0.0000000
8: 2011 B Yes 0.0 0.0000000 0.0 0.0000000 2.0 0.0000000
只要稍微修改聚合,就可以避免这种从长到宽的转换。