如何使用 for 循环在多列上使用 ddply？

Question

我发现了一些非常相似的 Whosebug 问题，但答案不是我想要的 (, Aggregate / summarize multiple variables per group (i.e. sum, mean, etc))

主要区别在于答案以不使用 for 循环（也不应用）而是使用聚合（或类似）的方式简化了他们的问题。然而，我有一大块代码可以顺利地进行各种总结、统计和绘图，所以我真正想做的是让循环或函数正常工作。我目前面临的问题是从循环中存储为 q 的列名到实际列（get() 对我不起作用）。见下文。

我的数据集与下面类似，但有 40 个特征：

Subject <- c(rep(1, times = 6), rep(2, times = 6))
GroupOfInterest <- c(letters[rep(1:3, times = 4)])
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
df.main <- data.frame(Subject,GroupOfInterest, Feature1, Feature2, 
Feature3, stringsAsFactors = FALSE)

到目前为止，我的尝试使用了 for 循环：

Feat <- c(colnames(df.main[3:5]))    
for (q in Feat){
df_sum = ddply(df.main, ~GroupOfInterest + Subject,
            summarise, q =mean(get(q)))
  }

我希望提供如下所示的输出（尽管我意识到现在需要一个单独的合并函数）：

然而，根据我的操作方式，我要么得到一个错误 ("Error in get(q) : invalid first argument")，要么它平均一个特征的所有值而不是按 Subject 和 GroupOfInterest 分组。

我也尝试过使用列表和 lapply 但运行遇到了类似的困难。

根据我收集到的信息，我的问题在于 ddply 需要 Feature1。但是如果我循环遍历，我要么为它提供 "Feature1" （字符串）或（1,14,14,16,17 ...），它不再是需要按主题分组的数据框的一部分和集团。

非常感谢您提供的任何帮助来解决这个问题并教我这个过程是如何工作的。

Answer 1

根据评论编辑；需要包括 as.character(.)

可以使用 summarise_at 吗？和辅助函数 vars(contains(...))?

df.main %>% 
    group_by(Subject, GroupOfInterest) %>% 
    summarise_at(vars(contains("Feature")), funs(mean(as.numeric(as.character(.)))))

Answer 2

上面给出了 dlyr 解决方案，但公平地说，这里是 data.table 一个

DT <- setDT(df.main)
DT[,lapply(.SD,function(x){mean(as.numeric(as.character(x)))}),
.SDcols = names(DT)[grepl("Feature",names(DT))], by = .(Subject,GroupOfInterest)]

   Subject GroupOfInterest Feature1 Feature2 Feature3
1:       1               a      6.5    459.5      2.0
2:       1               b     11.0    480.5      4.0
3:       1               c      9.5    453.0      4.5
4:       2               a      3.5    483.0      1.5
5:       2               b      8.0    449.0      3.5
6:       2               c     11.5    424.0      1.0

Answer 3

OP 提到使用简单的 for-loop 进行数据转换。我知道还有许多其他优化方法可以解决此问题，但为了尊重 OP 的要求，我尝试使用基于 for-loop 的解决方案。我使用 dplyr 因为 plyr 现在已经过时了。

library(dplyr)
Subject <- c(rep(1, times = 6), rep(2, times = 6))
GroupOfInterest <- c(letters[rep(1:3, times = 4)])
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
#small change in the way data.frame is created 
df.main <- data.frame(Subject,GroupOfInterest, Feature1, Feature2, 
 Feature3, stringsAsFactors = FALSE)

Feat <- c(colnames(df.main[3:5])) 

# Ready with Key columns on which grouping is done
resultdf <- unique(select(df.main, Subject, GroupOfInterest))
#> resultdf
#  Subject GroupOfInterest
#1       1               a
#2       1               b
#3       1               c
#7       2               a
#8       2               b
#9       2               c


#For loop for each column
for(q in Feat){
  summean <- paste0('mean(', q, ')')
  summ_name <- paste0(q) #Name of the column to store sum
  df_sum <- df.main %>% 
     group_by(Subject, GroupOfInterest) %>%
    summarise_(.dots = setNames(summean, summ_name)) 
  #merge the result of new sum column in resultdf
  resultdf <- merge(resultdf, df_sum, by = c("Subject", "GroupOfInterest"))
}

# Final result
#> resultdf
#  Subject GroupOfInterest Feature1 Feature2 Feature3
#1       1               a      6.5    473.0      3.5
#2       1               b      4.5    437.0      2.0
#3       1               c     12.0    415.5      3.5
#4       2               a     10.0    437.5      3.0
#5       2               b      3.0    447.0      4.5
#6       2               c      6.0    462.0      2.5

如何使用 for 循环在多列上使用 ddply？

How to use a for loop to use ddply on multiple columns?

for-loop

get

r

apply

plyr