将 plyr::ddply 转换为 dplyr

Convert plyr::ddply to dplyr

我有一个这样的数据框:

tmp <- read.table(header = T, text = "gene_id   gene_symbol ensembl_id  keep val1   val2    val3
x   a   Multiple    Yes 1   2   3
                  x1    a   Multiple    No  2   3   4
                  x2    a   Multiple    No  1   4   3
                  y b   Multiple    Yes 22  20  12
                  y1    b   Multiple    No  98  7   97
                  y2    b   Multiple    No  8   76  6")

我正在尝试按 gene_symbol 变量分组并计算 keep == "Yes" 的每一行与所有其他行 (keep == "No") 之间的相关性,并返回平均相关性以及gene_symbolgene_id。这是函数:

# function to calculate avg. correlation
calc.mean.corr <- function(x){
  gene.id <- x[which(x$keep == "Yes"),"gene_id"]
  x1 <- x %>% 
    filter(keep == "Yes") %>%
    select(-c(gene_id, gene_symbol, ensembl_id, keep)) %>%
    as.numeric()
  x2 <- x %>% 
    filter(keep == "No") %>%
    select(-c(gene_id, gene_symbol, ensembl_id, keep))

  # correlation of kept id with discarded ids
  cor <- mean(apply(x2, 1, FUN = function(y) cor(x1, y)))
  cor <- round(cor, digits = 2)
  df <- data.frame(avg.cor = cor, gene_id = gene.id)
  return(df)
}

# call using ddply
for.corr <- plyr::ddply(tmp, .variables = "gene_symbol", .fun = function(x) calc.mean.corr(x))

最终输出如下所示:

> for.corr
  gene_symbol avg.cor gene_id
1           a    0.83       x
2           b    0.02       y

我为此使用 plyr::ddply,但想改用 dplyr。但是,我不确定如何将其转换为 dplyr 格式。任何帮助将非常感激。

如果我们不想更改函数,一个选项是执行 group_split 并应用函数

library(dplyr)
library(purrr)
tmp %>%
   group_split(gene_symbol) %>%
   map_dfr(calc.mean.corr)

包括 gene_symbol

tmp %>%
    split(.$gene_symbol) %>%
    map_dfr(~ calc.mean.corr(.), .id = 'gene_symbol')
#    gene_symbol avg.cor gene_id
#1           a    0.83       x
#2           b    0.02       y