使用 dplyr 和 tidyr 计算分组 data.frame 的平均值
Compute average of grouped data.frame using dplyr and tidyr
我只是在学习 R 并试图找到方法来修改我的分组 data.frame
以获得变量 value
(x+y/2) 和标准的平均值内聚观察的偏差 (sd
) sqrt((x^2+y^2)/2)。其他(相等)变量(sequence
、value1
)不应更改。
我使用了 subset()
和 rowMeans()
,但我想知道使用 dplyr
和 tidyr
是否有更好的方法(可能使用嵌套数据框?)
我的测试 data.frame 看起来像:
id location value sd sequence value1
"anon1" "nose" 5 0.2 "a" 1
"anon2" "body" 4 0.4 "a" 2
"anon3" "left_arm" 3 0.3 "a" 3
"anon3" "right_arm" 5 0.6 "a" 3
"anon4" "head" 4 0.3 "a" 4
"anon5" "left_leg" 2 0.2 "a" 5
"anon5" "right_leg" 1 0.1 "a" 5
dput 我测试的输出 data.frame:
myData <- structure(list(ï..id = structure(c(1L, 2L, 3L, 3L, 4L, 5L, 5L
), .Label = c("anon1", "anon2", "anon3", "anon4", "anon5"), class = "factor"),
location = structure(c(5L, 1L, 3L, 6L, 2L, 4L, 7L), .Label = c("body",
"head", "left_arm", "left_leg", "nose", "right_arm", "right_leg"
), class = "factor"), value = c(5L, 4L, 3L, 5L, 4L, 2L, 1L
), sd = c(0.2, 0.4, 0.3, 0.6, 0.3, 0.2, 0.1), sequence = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "a", class = "factor"),
value1 = c(1L, 2L, 3L, 3L, 4L, 5L, 5L)), .Names = c("ï..id",
"location", "value", "sd", "sequence", "value1"), class = "data.frame", row.names = c(NA,
-7L))
它应该是什么样子:
id location value sd sequence value1
"anon1" "nose" 5 0.2 "a" 1
"anon2" "body" 4 0.4 "a" 2
"anon3" "arm" 4 0.47 "a" 3
"anon4" "head" 4 0.3 "a" 4
"anon5" "leg" 1.5 0.15 "a" 5
dplyr 的 group_by
和 summarise
会有所帮助,gsub
对字符串变量提供了一些支持:
library(dplyr)
myData %>%
group_by(id) %>%
summarise(
location = gsub(".*_", "", location[1]),
value = mean(value),
sd = mean(sd),
sequence = sequence[1],
value1 = value1[1]
)
#> # A tibble: 5 × 6
#> id location value sd sequence value1
#> <fctr> <chr> <dbl> <dbl> <fctr> <int>
#> 1 anon1 nose 5.0 0.20 a 1
#> 2 anon2 body 4.0 0.40 a 2
#> 3 anon3 arm 4.0 0.45 a 3
#> 4 anon4 head 4.0 0.30 a 4
#> 5 anon5 leg 1.5 0.15 a 5
或者如果 id
、sequence
和 value1
在所有情况下都匹配:
myData %>%
group_by(id, sequence, value1) %>%
summarise(
location = gsub(".*_", "", location[1]),
value = mean(value),
sd = mean(sd))
#> Source: local data frame [5 x 6]
#> Groups: id, sequence [?]
#>
#> id sequence value1 location value sd
#> <fctr> <fctr> <int> <chr> <dbl> <dbl>
#> 1 anon1 a 1 nose 5.0 0.20
#> 2 anon2 a 2 body 4.0 0.40
#> 3 anon3 a 3 arm 4.0 0.45
#> 4 anon4 a 4 head 4.0 0.30
#> 5 anon5 a 5 leg 1.5 0.15
我只是在学习 R 并试图找到方法来修改我的分组 data.frame
以获得变量 value
(x+y/2) 和标准的平均值内聚观察的偏差 (sd
) sqrt((x^2+y^2)/2)。其他(相等)变量(sequence
、value1
)不应更改。
我使用了 subset()
和 rowMeans()
,但我想知道使用 dplyr
和 tidyr
是否有更好的方法(可能使用嵌套数据框?)
我的测试 data.frame 看起来像:
id location value sd sequence value1
"anon1" "nose" 5 0.2 "a" 1
"anon2" "body" 4 0.4 "a" 2
"anon3" "left_arm" 3 0.3 "a" 3
"anon3" "right_arm" 5 0.6 "a" 3
"anon4" "head" 4 0.3 "a" 4
"anon5" "left_leg" 2 0.2 "a" 5
"anon5" "right_leg" 1 0.1 "a" 5
dput 我测试的输出 data.frame:
myData <- structure(list(ï..id = structure(c(1L, 2L, 3L, 3L, 4L, 5L, 5L
), .Label = c("anon1", "anon2", "anon3", "anon4", "anon5"), class = "factor"),
location = structure(c(5L, 1L, 3L, 6L, 2L, 4L, 7L), .Label = c("body",
"head", "left_arm", "left_leg", "nose", "right_arm", "right_leg"
), class = "factor"), value = c(5L, 4L, 3L, 5L, 4L, 2L, 1L
), sd = c(0.2, 0.4, 0.3, 0.6, 0.3, 0.2, 0.1), sequence = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "a", class = "factor"),
value1 = c(1L, 2L, 3L, 3L, 4L, 5L, 5L)), .Names = c("ï..id",
"location", "value", "sd", "sequence", "value1"), class = "data.frame", row.names = c(NA,
-7L))
它应该是什么样子:
id location value sd sequence value1
"anon1" "nose" 5 0.2 "a" 1
"anon2" "body" 4 0.4 "a" 2
"anon3" "arm" 4 0.47 "a" 3
"anon4" "head" 4 0.3 "a" 4
"anon5" "leg" 1.5 0.15 "a" 5
dplyr 的 group_by
和 summarise
会有所帮助,gsub
对字符串变量提供了一些支持:
library(dplyr)
myData %>%
group_by(id) %>%
summarise(
location = gsub(".*_", "", location[1]),
value = mean(value),
sd = mean(sd),
sequence = sequence[1],
value1 = value1[1]
)
#> # A tibble: 5 × 6
#> id location value sd sequence value1
#> <fctr> <chr> <dbl> <dbl> <fctr> <int>
#> 1 anon1 nose 5.0 0.20 a 1
#> 2 anon2 body 4.0 0.40 a 2
#> 3 anon3 arm 4.0 0.45 a 3
#> 4 anon4 head 4.0 0.30 a 4
#> 5 anon5 leg 1.5 0.15 a 5
或者如果 id
、sequence
和 value1
在所有情况下都匹配:
myData %>%
group_by(id, sequence, value1) %>%
summarise(
location = gsub(".*_", "", location[1]),
value = mean(value),
sd = mean(sd))
#> Source: local data frame [5 x 6]
#> Groups: id, sequence [?]
#>
#> id sequence value1 location value sd
#> <fctr> <fctr> <int> <chr> <dbl> <dbl>
#> 1 anon1 a 1 nose 5.0 0.20
#> 2 anon2 a 2 body 4.0 0.40
#> 3 anon3 a 3 arm 4.0 0.45
#> 4 anon4 a 4 head 4.0 0.30
#> 5 anon5 a 5 leg 1.5 0.15