计算一个变量在多个组中的出现百分比

Calculate the percent occurrence of a variable in multiple groups

示例数据

set.seed(123)
df <- data.frame(loc.id = rep(1:1000, each = 35), year = rep(1980:2014,times = 1000),month.id = sample(c(1:4,8:10,12),35*1000,replace = T))

数据框有 1000 个位置 X 35 年的数据,用于名为 month.id 的变量,基本上是一年中的月份。对于每年,我想计算每个月的发生百分比。例如1980 年,

month.vec <- df[df$year == 1980,]
table(month.vec$month.id)
1   2   3   4   8   9  10  12 
106 132 116 122 114 130 141 139 

计算月份的出现百分比:

table(month.vec$month.id)/length(month.vec$month.id) * 100
1    2    3    4    8    9   10   12 
10.6 13.2 11.6 12.2 11.4 13.0 14.1 13.9 

我想要一个 table 像这样的东西:

    year month percent
    1980   1    10.6
    1980   2    13.2
    1980   3    11.6
    1980   4    12.2
    1980   5    NA
    1980   6    NA
    1980   7    NA
    1980   8    11.4    
    1980   9    13
    1980   10   14.1
    1980   11   NA
    1980   12   13.9

因为缺少第 5、6、7、11 个月,我只想为这些月份添加带有 NA 的额外行。如果可能的话,我会 像这样的 dplyr 解决方案:

   library(dplyr)
   df %>% group_by(year) %>% summarise(percentage.contri = table(month.id)/length(month.id)*100)  

解决方案使用 dplyrtidyr

# To get month as integer use (or add as.integer to mutate):
# df$month.id <- as.integer(df$month.id)

library(dplyr)
library(tidyr)
df %>%
    group_by(year, month.id) %>% 
    # Count occurrences per year & month
    summarise(n = n()) %>%
    # Get percent per month (year number is calculated with sum(n))
    mutate(percent = n / sum(n) * 100) %>%
    # Fill in missing months
    complete(year, month.id = 1:12, fill = list(percent = 0)) %>%
    select(year, month.id, percent)
    year month.id percent
   <int>    <dbl>   <dbl>
 1  1980     1.00    10.6
 2  1980     2.00    13.2
 3  1980     3.00    11.6
 4  1980     4.00    12.2
 5  1980     5.00     0  
 6  1980     6.00     0  
 7  1980     7.00     0  
 8  1980     8.00    11.4
 9  1980     9.00    13.0
10  1980    10.0     14.1
11  1980    11.0      0  
12  1980    12.0     13.9

基础 R 解决方案:

tab <- table(month.vec$year, factor(month.vec$month.id, levels = 1:12))/length(month.vec$month.id) * 100
dfnew <- as.data.frame(tab)

给出:

> dfnew
   Var1 Var2 Freq
1  1980    1 10.6
2  1980    2 13.2
3  1980    3 11.6
4  1980    4 12.2
5  1980    5  0.0
6  1980    6  0.0
7  1980    7  0.0
8  1980    8 11.4
9  1980    9 13.0
10 1980   10 14.1
11 1980   11  0.0
12 1980   12 13.9

data.table:

library(data.table)

setDT(month.vec)[, .N, by = .(year, month.id)
                 ][.(year = 1980, month.id = 1:12), on = .(year, month.id)
                   ][, N := 100 * N/sum(N, na.rm = TRUE)][]