data.table 中包含许多二元分类列的观察结果的患病率估计

Prevalence Estimates from Observations in data.table Containing Many Binary Classification Columns

我正在通过蛮力根据原始 data.table 进行流行率估计,我需要提高效率。你能帮忙吗?

我的 data.table 每行包含一个加权观察。有许多列充当二进制虚拟变量,指示特定观察是否属于许多可能分类中的一个或多个。 (例如,故事可以是 'amazing'、'boring' 或 'charming',或三者的任意组合。)

必须有一种 data.table 方法来替换我的 forloop。我还怀疑我可能不一定需要生成 queries 集。我很欣赏对这个问题的全新看法。

library(data.table)

set.seed(42)
# I have many weighted observations that can be labeled as belonging to one of many categories
# in this example, I simulate 10 observations and only 3 categories
dt = data.table(
        weight = runif( n = 10 , min = 0, max = 1 ),
        a = sample( x = c(0,1) , size = 10 , replace = TRUE ),
        b = sample( x = c(0,1) , size = 10 , replace = TRUE ),
        c = sample( x = c(0,1) , size = 10 , replace = TRUE )
)

# Generate all combinations of categories
queries = as.data.table( expand.grid( rep( list(0:1) , length(names(dt))-1 ) ) )
names(queries) = names(dt)[ 2:length(names(dt)) ] # rename Var1, Var2, Var3 to a, b, c

# Brute force through each possible combination to calculate prevalence
prevalence = rep( NA, nrow(queries) )
for( q in 1:nrow(queries) ){
    prevalence[q] = dt[ a == queries[q, a] & b == queries[q, b] & c == queries[q, c] , sum(weight) ] / dt[ , sum(weight) ]
}

results = copy(queries)
results$prevalence = prevalence

results

输出为:

#   a b c prevalence
#1: 0 0 0 0.09771385
#2: 1 0 0 0.10105192
#3: 0 1 0 0.36229784
#4: 1 1 0 0.00000000
#5: 0 0 1 0.00000000
#6: 1 0 1 0.05993197
#7: 0 1 1 0.00000000
#8: 1 1 1 0.37900443

Updated: The original question had 42 simulated observations and the data covered each possible combination of categories (a, b, c). The question was revised to only include 10 simulated observations so there would be combinations with no observations (and zero prevalence).

更新答案

方法一:

  1. 使用CJ创建a,b,c的完整组合,然后加入dt(如
  2. 将每组的 weight 相加然后除以 totoal_weight
  3. NA的出现是理性的。如果需要,您也可以通过 0nafill 函数填充它。
total_weight = sum(dt$weight)
dt[CJ(a, b, c, unique = TRUE),  
   on = .(a, b, c)][, 
                    .( prevalence = sum(weight)/total_weight), 
                    by = .(a,b,c)]

#      a     b     c prevalence
#   <num> <num> <num>      <num>
#1:     0     0     0 0.09771385
#2:     0     0     1         NA
#3:     0     1     0 0.36229784
#4:     0     1     1         NA
#5:     1     0     0 0.10105192
#6:     1     0     1 0.05993197
#7:     1     1     0         NA
#8:     1     1     1 0.37900443

方法二:

dt2 = dt[,.( prevalence = sum(weight) / total_weight ), by = .(a,b,c)]
dt2[queries, on = .(a,b,c)]
# or `queries[, prevalence := fcoalesce(dt2[queries, prevalence])]`
#       a     b     c prevalence
#   <int> <int> <int>      <num>
#1:     0     0     0 0.09771385
#2:     1     0     0 0.10105192
#3:     0     1     0 0.36229784
#4:     1     1     0         NA
#5:     0     0     1         NA
#6:     1     0     1 0.05993197
#7:     0     1     1         NA
#8:     1     1     1 0.37900443

原回答

可以分组计算

dt[,.( prevalence = sum(weight) / dt[,sum(weight)] ), by = .(a,b,c)]
  • 每个组对应您的类别
  • 将每组的weight相加然后除以总重量

这里有一些解决方案(在这两种情况下,您可以将 keyby 参数替换为 by

如果您的数据集 (dt) 已经包含不同类别的所有可能组合,那么您可以这样做(如

dt[, .(prevalence = sum(weight)/sum(dt$weight)), keyby=.(a, b, c)]

#        a     b     c prevalence
# 1:     0     0     0 0.10876301
# 2:     0     0     1 0.02135357
# 3:     0     1     0 0.03775363
# 4:     0     1     1 0.12806864
# 5:     1     0     0 0.18204696
# 6:     1     0     1 0.15197811
# 7:     1     1     0 0.25629705
# 8:     1     1     1 0.11373903

相反,如果数据集不包含不同类别的所有可能组合,那么您可以按如下方式解决(CJ(a, b, c, unique=TRUE) 计算所有组合并删除重复项)

dt[CJ(a, b, c, unique=TRUE), .(prevalence = sum(weight)/sum(dt$weight)), keyby=.(a, b, c), on=.(a, b, c)]

#        a     b     c prevalence
# 1:     0     0     0 0.10876301
# 2:     0     0     1 0.02135357
# 3:     0     1     0 0.03775363
# 4:     0     1     1 0.12806864
# 5:     1     0     0 0.18204696
# 6:     1     0     1 0.15197811
# 7:     1     1     0 0.25629705
# 8:     1     1     1 0.11373903