Data.table:将函数应用于分组列的行(也可以称为折叠列)

Data.table: Applying function to rows of grouped columns (could also call it collapsing columns)

我正在寻找一种更优雅的方法来将函数(即求和)应用于列组中的每一行。我通过转置和折叠这些列来实现它,但这需要对大型数据集进行大量计算。

这是我的示例数据:

data <- data.table(C1=rep(1,5),C2=rep(2,5),C3=rep(1,5),C4=rep(2,5))
data
#>   C1 C2 C3 C4
#> 1:  1  2  1  2
#> 2:  1  2  1  2
#> 3:  1  2  1  2
#> 4:  1  2  1  2
#> 5:  1  2  1  2
group <- data.table(Sample=c("C1","C2","C3","C4"),Group = c("X","Y","X","Y"))
group
#>   Sample Group
#> 1:     C1     X
#> 2:     C2     Y
#> 3:     C3     X
#> 4:     C4     Y

我只想将"C1""C3"(组X)相加,"C2""C4"(组Y) 在一起,并将组名作为列名。这就是我想要结束的:

   X Y
1: 2 4
2: 2 4
3: 2 4
4: 2 4
5: 2 4

这是我的解决方案:

data <- data.table(C1=rep(1,5),C2=rep(2,5),C3=rep(1,5),C4=rep(2,5))
group <- data.table(Sample=c("C1","C2","C3","C4"),Group = c("X","Y","X","Y"))

data <- transpose(data)
data <- data[,lapply(.SD,sum),by=list(group$Group)]
data <- transpose(data,make.names = "group")
data
#>   X Y
#> 1: 2 4
#> 2: 2 4
#> 3: 2 4
#> 4: 2 4
#> 5: 2 4

它有效,但我相信还有更好的方法。对于大型矩阵,转置两次非常昂贵。

如果顺序相同,则使用split.default

setDT(lapply(split.default(data, group$Group), rowSums))[]

-输出

   X Y
1: 2 4
2: 2 4
3: 2 4
4: 2 4
5: 2 4

如果列名的顺序不同,则使用命名向量匹配

nm1 <- setNames(group$Group, group$Sample)[colnames(data)]
setDT(lapply(split.default(data, nm1), rowSums))[]

或者也可以从 'group' 数据中执行 split 并遍历 list,提取列,然后执行 rowSums

setDT(lapply(split(group$Sample, group$Group),
       function(x) rowSums(data[, ..x])))[]

基准

set.seed(24)
data_test <- as.data.table(matrix(rnorm(5000 * 5000), ncol = 5000, dimnames = list(NULL, paste0("C", 1:5000))))

group_test <- data.table(Sample= paste0("C", 1:5000),Group = rep(LETTERS[1:10], 500) )

system.time({
nm1 <- setNames(group_test$Group, group_test$Sample)[colnames(data_test)]
setDT(lapply(split.default(data_test, nm1), rowSums))[]

})
#   user  system elapsed 
#  0.167   0.048   0.219 


system.time({
long <- melt(data_test[, rn := .I], "rn")
dcast(long[group_test, on = "variable==Sample"], rn ~ Group, sum)


})
#   user  system elapsed 
#  2.897   0.305   3.189 

也许,为data考虑一种完全不同的存储格式可能是值得的。

data 重塑为 长格式 将允许将列名称视为数据项并与 group 连接。

long <- melt(data[, rn := .I], "rn")
dcast(long[group, on = "variable==Sample"], rn ~ Group, sum)
   rn X Y
1:  1 2 4
2:  2 2 4
3:  3 2 4
4:  4 2 4
5:  5 2 4

这是您可以使用的另一种解决方案:

library(dplyr)
library(tidyr)
library(rlang)

group %>%
  group_by(Group) %>%
  summarise(Sum = eval_tidy(parse_expr(paste0(Sample, collapse = "+")), data = data)) %>%
  mutate(id = row_number()) %>%
  pivot_wider(names_from = Group, values_from = Sum)

# A tibble: 5 x 3
     id     X     Y
  <int> <dbl> <dbl>
1     1     2     4
2     2     2     4
3     3     2     4
4     4     2     4
5     5     2     4