聚合来自多列而不是单列的数据

Aggregating data from multiple columns instead of a single column

我有一个巨大的遗传表达数据集,200k 个变量(行)和 170 个 obs(列)。下面是第一对rows/columns

    Gene    Transcript_ID   V1  V2  V3  V4  V5
1   ENSG00000000003.14  ENST00000612152.4   0   6   0   3   15
2   ENSG00000000003.14  ENST00000373020.8   4   0   5   0   0
3   ENSG00000000003.14  ENST00000614008.4   0   0   0   0   0
4   ENSG00000000003.14  ENST00000496771.5   0   3   0   0   7

我正在尝试对所有数据进行分组以按基因表达。我正在利用现有语法通过一些元数据(基因 ID)对单个数据列进行分组,并尝试将所有 170 个 obs 的数据列设为 运行。语法如下,应该是一个非常简单的修复。

transcript_grouped <-aggregate(res$V1, by=list(Category=res$Gene), FUN=sum)

V1 是列名或 observation/data 列,Res 是整个数据集,基因是我希望数据分组的类别。此语法适用于 V1,但我需要对所有列使用 运行。

我试过为所有列名创建一个变量,甚至手动粘贴它们。

dataColumns<- dataColumns = c("V1","V2","V3","V4","V5","V6","V7","V8","V9","V10","V11","V12","V13","V14","V15","V16","V17","V18","V19","V20","V21","V22","V23","V24","V25","V26","V27","V28","V29","V30","V31","V32","V33","V34","V35","V36","V37","V38","V39","V40","V41","V42","V43","V44","V45","V46","V47","V48","V49","V50","V51","V52","V53","V54","V55","V56","V57","V58","V59","V60","V61","V62","V63","V64","V65","V66","V67","V68","V69","V70","V71","V72","V73","V74","V75","V76","V77","V78","V79","V80","V81","V82","V83","V84","V85","V86","V87","V88","V89","V90","V91","V92","V93","V94","V95","V96","V97","V98","V99","V100","V101","V102","V103","V104","V105","V106","V107","V108","V109","V110","V111","V112","V113","V114","V115","V116","V117","V118","V119","V120","V121","V122","V123","V124","V125","V126","V127","V128","V129","V130","V131","V132","V133","V134","V135","V136","V137","V138","V139","V140","V141","V142","V143","V144","V145","V146","V147","V148","V149","V150","V151","V152","V153","V154","V155","V156","V157","V158","V159","V160","V161","V162","V163","V164","V165","V166") 

trans_grouped <-aggregate(res$dataColumns, by=list(Category=res$Gene), FUN=sum)

Error in aggregate.data.frame(as.data.frame(x), ...) : no rows to aggregate

请问如何循环以包括所有列?

这个dplyr解决方案怎么样:

library(dplyr)
df %>%
  group_by(Gene) %>%
  summarise(across(starts_with("V"), ~sum(.)))
# A tibble: 2 x 4
  Gene     V1    V2    V3
* <chr> <dbl> <dbl> <dbl>
1 A         4     4     7
2 B         6     4     3

测试数据:

df <- data.frame(
  Gene = c("A", "B", "A", "B"),
  V1 = c(1,2,3,4),
  V2 = c(2,2,2,2),
  V3 = c(4,2,3,1)
)

使用 aggregate:如果我们删除第二列,它会起作用:

aggregate(. ~ Gene, df[-2], FUN=sum)

输出:

                Gene V1 V2 V3 V4 V5
1 ENSG00000000003.14  4  9  5  3 22

我们可以将 summarisedplyr 包中的 across 一起使用: 感谢 Chris Ruehlemann,他的回答早了 3 分钟!!!

df %>% 
  group_by(Gene) %>% 
  summarise(across(starts_with('V'), sum))

输出:

 Gene                  V1    V2    V3    V4    V5
  <chr>              <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000000003.14     4     9     5     3    22

数据:

df <- structure(list(Gene = c("ENSG00000000003.14", "ENSG00000000003.14", 
"ENSG00000000003.14", "ENSG00000000003.14"), Transcript_ID = c("ENST00000612152.4", 
"ENST00000373020.8", "ENST00000614008.4", "ENST00000496771.5"
), V1 = c(0, 4, 0, 0), V2 = c(6, 0, 0, 3), V3 = c(0, 5, 0, 0), 
V4 = c(3, 0, 0, 0), V5 = c(15, 0, 0, 7)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), spec = structure(list(
cols = list(Gene = structure(list(), class = c("collector_character", 
"collector")), Transcript_ID = structure(list(), class = c("collector_character", 
"collector")), V1 = structure(list(), class = c("collector_double", 
"collector")), V2 = structure(list(), class = c("collector_double", 
"collector")), V3 = structure(list(), class = c("collector_double", 
"collector")), V4 = structure(list(), class = c("collector_double", 
"collector")), V5 = structure(list(), class = c("collector_double", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1L), class = "col_spec"))