查找多个子组的百分比

Find percentage over multiple subgroups

我有一些多年来国家/地区的数据,我按两列分组:yearcountry。这是它的样子:

现在,我想计算此 summarised() table 的比例(或百分比),但仅限于每年。换句话说,sum 只会计算具有相同 year?我该怎么做?

这是天真的方法:

library(dplyr)

df %>% 
  group_by(start_year, sending_country_code) %>% 
  summarise(cnt = n()) %>% 
  mutate(perc = round(100 * cnt / sum(cnt), 2))

但我不确定 sum(cnt) 命令是对整个 cnt 列求和,还是完全按照我的要求进行(按 year 子集)。


您可以使用

获取数据集
tuesdata <- tidytuesdayR::tt_load(2022, week = 10)
erasmus <- tuesdata$erasmus

erasmus = erasmus %>%  separate(academic_year, c("start_year", "year2"), "-", convert = TRUE) %>% 
  select(-year2)

我发现你的代码已经正确了。以下是我需要确保的两个步骤:

  1. 求出每年的总和
sums <- erasmus %>% 
     group_by(start_year, sending_country_code) %>% 
     summarise(cnt = n()) %>% 
     pivot_wider(names_from = start_year, values_from = cnt) %>% select(contains('20'))%>%
    map_dbl(sum, na.rm = TRUE)
#`summarise()` has grouped output by 'start_year'. You can override using the `.groups` argument.

 sums
# 2014  2015  2016  2017  2018  2019 
# 4966 26565 33200 33261 33645 32998 
  1. 返回第一个数据,然后根据总和计算百分比
erasmus %>% 
     group_by(start_year, sending_country_code) %>% 
     summarise(cnt = n()) %>% 
     mutate(perc = round(100*cnt/sums[names(sums) %in% start_year], 2))
#`summarise()` has grouped output by 'start_year'. You can override using the `.groups` argument.
# A tibble: 290 × 4
# Groups:   start_year [6]
#   start_year sending_country_code   cnt  perc
#        <int> <chr>                <int> <dbl>
# 1       2014 AL                       2  0.04
# 2       2014 AM                       3  0.06
# 3       2014 AT                      91  1.83
# 4       2014 BA                       2  0.04
# 5       2014 BE                      22  0.44
# 6       2014 BG                     133  2.68
# 7       2014 CY                      25  0.5 
# 8       2014 CZ                     125  2.52
# 9       2014 DE                     646 13.0 
#10       2014 DK                      27  0.54
# … with 280 more rows

编辑 通过使用 %>%%T>% 可以将上述两个步骤变成 single-line 管道链操作:

erasmus %>% 
    group_by(start_year, sending_country_code) %>% 
    summarise(cnt = n()) %T>% 
    {\(x) sums <<- x %>%
          pivot_wider(names_from = start_year, values_from = cnt) %>% 
          select(contains('20'))%>% 
          map_dbl(sum, na.rm = TRUE)
    }() %>% 
    mutate(perc = round(100*cnt/sums[names(sums) %in% start_year], 2))
    

 # `summarise()` has grouped output by 'start_year'. You can override using the
# `.groups` argument.
# # A tibble: 290 × 4
# # Groups:   start_year [6]
# start_year sending_country_code   cnt  perc
# <int> <chr>                <int> <dbl>
#   1       2014 AL                       2  0.04
# 2       2014 AM                       3  0.06
# 3       2014 AT                      91  1.83
# 4       2014 BA                       2  0.04
# 5       2014 BE                      22  0.44
# 6       2014 BG                     133  2.68
# 7       2014 CY                      25  0.5 
# 8       2014 CZ                     125  2.52
# 9       2014 DE                     646 13.0 
# 10       2014 DK                      27  0.54
# # … with 280 more rows