查找多个子组的百分比
Find percentage over multiple subgroups
我有一些多年来国家/地区的数据,我按两列分组:year
和 country
。这是它的样子:
现在,我想计算此 summarised()
table 的比例(或百分比),但仅限于每年。换句话说,sum
只会计算具有相同 year
?我该怎么做?
这是天真的方法:
library(dplyr)
df %>%
group_by(start_year, sending_country_code) %>%
summarise(cnt = n()) %>%
mutate(perc = round(100 * cnt / sum(cnt), 2))
但我不确定 sum(cnt)
命令是对整个 cnt
列求和,还是完全按照我的要求进行(按 year
子集)。
您可以使用
获取数据集
tuesdata <- tidytuesdayR::tt_load(2022, week = 10)
erasmus <- tuesdata$erasmus
erasmus = erasmus %>% separate(academic_year, c("start_year", "year2"), "-", convert = TRUE) %>%
select(-year2)
我发现你的代码已经正确了。以下是我需要确保的两个步骤:
- 求出每年的总和
sums <- erasmus %>%
group_by(start_year, sending_country_code) %>%
summarise(cnt = n()) %>%
pivot_wider(names_from = start_year, values_from = cnt) %>% select(contains('20'))%>%
map_dbl(sum, na.rm = TRUE)
#`summarise()` has grouped output by 'start_year'. You can override using the `.groups` argument.
sums
# 2014 2015 2016 2017 2018 2019
# 4966 26565 33200 33261 33645 32998
- 返回第一个数据,然后根据总和计算百分比
erasmus %>%
group_by(start_year, sending_country_code) %>%
summarise(cnt = n()) %>%
mutate(perc = round(100*cnt/sums[names(sums) %in% start_year], 2))
#`summarise()` has grouped output by 'start_year'. You can override using the `.groups` argument.
# A tibble: 290 × 4
# Groups: start_year [6]
# start_year sending_country_code cnt perc
# <int> <chr> <int> <dbl>
# 1 2014 AL 2 0.04
# 2 2014 AM 3 0.06
# 3 2014 AT 91 1.83
# 4 2014 BA 2 0.04
# 5 2014 BE 22 0.44
# 6 2014 BG 133 2.68
# 7 2014 CY 25 0.5
# 8 2014 CZ 125 2.52
# 9 2014 DE 646 13.0
#10 2014 DK 27 0.54
# … with 280 more rows
编辑
通过使用 %>%
和 %T>%
可以将上述两个步骤变成 single-line 管道链操作:
erasmus %>%
group_by(start_year, sending_country_code) %>%
summarise(cnt = n()) %T>%
{\(x) sums <<- x %>%
pivot_wider(names_from = start_year, values_from = cnt) %>%
select(contains('20'))%>%
map_dbl(sum, na.rm = TRUE)
}() %>%
mutate(perc = round(100*cnt/sums[names(sums) %in% start_year], 2))
# `summarise()` has grouped output by 'start_year'. You can override using the
# `.groups` argument.
# # A tibble: 290 × 4
# # Groups: start_year [6]
# start_year sending_country_code cnt perc
# <int> <chr> <int> <dbl>
# 1 2014 AL 2 0.04
# 2 2014 AM 3 0.06
# 3 2014 AT 91 1.83
# 4 2014 BA 2 0.04
# 5 2014 BE 22 0.44
# 6 2014 BG 133 2.68
# 7 2014 CY 25 0.5
# 8 2014 CZ 125 2.52
# 9 2014 DE 646 13.0
# 10 2014 DK 27 0.54
# # … with 280 more rows
我有一些多年来国家/地区的数据,我按两列分组:year
和 country
。这是它的样子:
现在,我想计算此 summarised()
table 的比例(或百分比),但仅限于每年。换句话说,sum
只会计算具有相同 year
?我该怎么做?
这是天真的方法:
library(dplyr)
df %>%
group_by(start_year, sending_country_code) %>%
summarise(cnt = n()) %>%
mutate(perc = round(100 * cnt / sum(cnt), 2))
但我不确定 sum(cnt)
命令是对整个 cnt
列求和,还是完全按照我的要求进行(按 year
子集)。
您可以使用
获取数据集tuesdata <- tidytuesdayR::tt_load(2022, week = 10)
erasmus <- tuesdata$erasmus
erasmus = erasmus %>% separate(academic_year, c("start_year", "year2"), "-", convert = TRUE) %>%
select(-year2)
我发现你的代码已经正确了。以下是我需要确保的两个步骤:
- 求出每年的总和
sums <- erasmus %>%
group_by(start_year, sending_country_code) %>%
summarise(cnt = n()) %>%
pivot_wider(names_from = start_year, values_from = cnt) %>% select(contains('20'))%>%
map_dbl(sum, na.rm = TRUE)
#`summarise()` has grouped output by 'start_year'. You can override using the `.groups` argument.
sums
# 2014 2015 2016 2017 2018 2019
# 4966 26565 33200 33261 33645 32998
- 返回第一个数据,然后根据总和计算百分比
erasmus %>%
group_by(start_year, sending_country_code) %>%
summarise(cnt = n()) %>%
mutate(perc = round(100*cnt/sums[names(sums) %in% start_year], 2))
#`summarise()` has grouped output by 'start_year'. You can override using the `.groups` argument.
# A tibble: 290 × 4
# Groups: start_year [6]
# start_year sending_country_code cnt perc
# <int> <chr> <int> <dbl>
# 1 2014 AL 2 0.04
# 2 2014 AM 3 0.06
# 3 2014 AT 91 1.83
# 4 2014 BA 2 0.04
# 5 2014 BE 22 0.44
# 6 2014 BG 133 2.68
# 7 2014 CY 25 0.5
# 8 2014 CZ 125 2.52
# 9 2014 DE 646 13.0
#10 2014 DK 27 0.54
# … with 280 more rows
编辑
通过使用 %>%
和 %T>%
可以将上述两个步骤变成 single-line 管道链操作:
erasmus %>%
group_by(start_year, sending_country_code) %>%
summarise(cnt = n()) %T>%
{\(x) sums <<- x %>%
pivot_wider(names_from = start_year, values_from = cnt) %>%
select(contains('20'))%>%
map_dbl(sum, na.rm = TRUE)
}() %>%
mutate(perc = round(100*cnt/sums[names(sums) %in% start_year], 2))
# `summarise()` has grouped output by 'start_year'. You can override using the
# `.groups` argument.
# # A tibble: 290 × 4
# # Groups: start_year [6]
# start_year sending_country_code cnt perc
# <int> <chr> <int> <dbl>
# 1 2014 AL 2 0.04
# 2 2014 AM 3 0.06
# 3 2014 AT 91 1.83
# 4 2014 BA 2 0.04
# 5 2014 BE 22 0.44
# 6 2014 BG 133 2.68
# 7 2014 CY 25 0.5
# 8 2014 CZ 125 2.52
# 9 2014 DE 646 13.0
# 10 2014 DK 27 0.54
# # … with 280 more rows