使用dplyr计算两组出现的百分比和频率
Use dplyr to calculate percentage and frequency of occurrence of two groups
我正在学习 dplyr 并从类似的帖子中搜索解决方案,但发现 none 遇到了这些问题的组合。
这是一个示例数据框:
set.seed(1)
df <- data.frame(sampleID = c(rep("sample1",2),
rep("sample2",3),
rep("sample3",4)),
species = c("clover","nettle",
"clover","nettle","vine",
"clover","clover","nettle","vine"),
type = c("vegetation","seed",
"vegetation","vegetation","vegetation",
"seed","vegetation","seed","vegetation"),
mass = sample(1:9))
> df
sampleID species type mass
1 sample1 clover vegetation 9
2 sample1 nettle seed 4
3 sample2 clover vegetation 7
4 sample2 nettle vegetation 1
5 sample2 vine vegetation 2
6 sample3 clover seed 6
7 sample3 clover vegetation 3
8 sample3 nettle seed 8
9 sample3 vine vegetation 5
我需要 return 一个数据框来计算每个唯一 species/type 组合的质量百分比,并且我需要在 sampleIDs
中出现 species/type 的频率百分比
所以在这个例子中 vine/vegetation 的 species/type 的解决方案是
质量百分比 = (5+2)/(总和(质量))
并且百分比频率将是 2/3,因为样本 1 中没有出现该组合。
首先,我尝试了不同的组合,例如:
df %>%
group_by(species,type) %>%
summarize(totmass = sum(mass)) %>%
mutate(percmass = totmass/sum(totmass))
但这给出了 vine/vegetation 的 100% 质量?我也不知道从那里去哪里得到基于 sampleID 的百分比频率。
不确定我是否理解正确,但也许这就是您要找的:
set.seed(1)
df <- data.frame(sampleID = c(rep("sample1",2),
rep("sample2",3),
rep("sample3",4)),
species = c("clover","nettle",
"clover","nettle","vine",
"clover","clover","nettle","vine"),
type = c("vegetation","seed",
"vegetation","vegetation","vegetation",
"seed","vegetation","seed","vegetation"),
mass = sample(1:9))
library(dplyr)
df %>%
# Add total mass
add_count(wt = mass, name = "sum_mass") %>%
# Add total number of samples
add_count(nsamples = n_distinct(sampleID)) %>%
# Add sum_mass and nsamples to group_by
group_by(species, type, sum_mass, nsamples) %>%
summarize(nsample = n_distinct(sampleID),
totmass = sum(mass), .groups = "drop") %>%
mutate(percmass = totmass / sum_mass,
percfreq = nsample / nsamples)
#> # A tibble: 5 x 8
#> species type sum_mass nsamples nsample totmass percmass percfreq
#> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl>
#> 1 clover seed 45 3 1 6 0.133 0.333
#> 2 clover vegetation 45 3 3 19 0.422 1
#> 3 nettle seed 45 3 2 12 0.267 0.667
#> 4 nettle vegetation 45 3 1 1 0.0222 0.333
#> 5 vine vegetation 45 3 2 7 0.156 0.667
我正在学习 dplyr 并从类似的帖子中搜索解决方案,但发现 none 遇到了这些问题的组合。
这是一个示例数据框:
set.seed(1)
df <- data.frame(sampleID = c(rep("sample1",2),
rep("sample2",3),
rep("sample3",4)),
species = c("clover","nettle",
"clover","nettle","vine",
"clover","clover","nettle","vine"),
type = c("vegetation","seed",
"vegetation","vegetation","vegetation",
"seed","vegetation","seed","vegetation"),
mass = sample(1:9))
> df
sampleID species type mass
1 sample1 clover vegetation 9
2 sample1 nettle seed 4
3 sample2 clover vegetation 7
4 sample2 nettle vegetation 1
5 sample2 vine vegetation 2
6 sample3 clover seed 6
7 sample3 clover vegetation 3
8 sample3 nettle seed 8
9 sample3 vine vegetation 5
我需要 return 一个数据框来计算每个唯一 species/type 组合的质量百分比,并且我需要在 sampleIDs
中出现 species/type 的频率百分比所以在这个例子中 vine/vegetation 的 species/type 的解决方案是 质量百分比 = (5+2)/(总和(质量)) 并且百分比频率将是 2/3,因为样本 1 中没有出现该组合。
首先,我尝试了不同的组合,例如:
df %>%
group_by(species,type) %>%
summarize(totmass = sum(mass)) %>%
mutate(percmass = totmass/sum(totmass))
但这给出了 vine/vegetation 的 100% 质量?我也不知道从那里去哪里得到基于 sampleID 的百分比频率。
不确定我是否理解正确,但也许这就是您要找的:
set.seed(1)
df <- data.frame(sampleID = c(rep("sample1",2),
rep("sample2",3),
rep("sample3",4)),
species = c("clover","nettle",
"clover","nettle","vine",
"clover","clover","nettle","vine"),
type = c("vegetation","seed",
"vegetation","vegetation","vegetation",
"seed","vegetation","seed","vegetation"),
mass = sample(1:9))
library(dplyr)
df %>%
# Add total mass
add_count(wt = mass, name = "sum_mass") %>%
# Add total number of samples
add_count(nsamples = n_distinct(sampleID)) %>%
# Add sum_mass and nsamples to group_by
group_by(species, type, sum_mass, nsamples) %>%
summarize(nsample = n_distinct(sampleID),
totmass = sum(mass), .groups = "drop") %>%
mutate(percmass = totmass / sum_mass,
percfreq = nsample / nsamples)
#> # A tibble: 5 x 8
#> species type sum_mass nsamples nsample totmass percmass percfreq
#> <chr> <chr> <int> <int> <int> <int> <dbl> <dbl>
#> 1 clover seed 45 3 1 6 0.133 0.333
#> 2 clover vegetation 45 3 3 19 0.422 1
#> 3 nettle seed 45 3 2 12 0.267 0.667
#> 4 nettle vegetation 45 3 1 1 0.0222 0.333
#> 5 vine vegetation 45 3 2 7 0.156 0.667