合并具有多个值的列
Merge on columns with multiple values
我有一个数据框 cluster
,其中一列 cluster$Genes
如下所示:
ENSG00000134684
ENSG00000188846, ENSG00000181163, ENSG00000114391
ENSG00000134684, ENSG00000175390
ENSG00000134684
ENSG00000134684, ENSG00000175390
...
列中每行的元素数是任意的。我还有另一个数据框,expression
,看起来像这样:
ENSGID a b
ENSG00000134684 1 3
ENSG00000175390 2 0
ENSG00000000419 131.23 108.73
ENSG00000000457 7.11 8.68
ENSG00000000460 15.70 6.59
ENSG00000000938 0 0
ENSG00000000971 0.03 0.07
ENSG00000001036 59.22 58.3
...
... 大约有 20000 行。我想做的是:
- 对
cluster$Genes
中每一行的所有元素,求出对应的a
和b
值
- 为
cluster$Genes
中的每一行分别计算a
和b
的最小值、最大值和平均值
- 在
cluster
数据框中创建六个新列并用 (min.a, max.a, mean.a, min.b, max.b, mean.b)
值填充它们
我试图找到一些方法来做到这一点,但并不顺利。在谷歌上寻求帮助时,我想我可能会使用某种 apply
,并且我得到了一些代码。我认为它主要是胡言乱语并且完全不起作用,而且我有点卡住了。这是我得到的:
exp.lookup = function(genes) {
genes.split = strsplit(genes, ', ')
exp.hct = list()
exp.hke = list()
for ( gene in genes.split ) {
exp.hct = c(exp.hct, merge(gene, means$hct, all.x=TRUE))
exp.hke = c(exp.hke, merge(gene, means$hke, all.x=TRUE))
return(c(exp.hct, exp.hke))
}
}
apply(cluster['Genes'], 1, FUN=exp.lookup)
有没有人有更好的想法,可能真的有用?
假设每个 ENSGID
对应一对唯一的 a 和 b 值,我建议:
将 cluster$Genes
赋给一个变量(换句话说,在 cluster
data-frame 之外复制它)。例如,new_cluster_genes <- cluster$Genes
操纵new_cluster_genes
,使每一行都有一个ENSGID
。添加名为 ENSGID
.
的列 header
将new_cluster_genes
与表达式data-frame合并,使用ENSGID
作为通用ID。将结果 data-frame 分配给一个变量:例如,merged_genes
.
计算每行 a 和 b 的最小值、最大值和平均值(分别):
library(dplyr)
merged_genes %>%
mutate(min.a = min(a),
max.a = max(a),
mean.a = mean(a),
min.b = min(b),
max.b = max(b),
mean.b = mean(b)) -> merged_genes
创建 6 个新列并用 (min.a、max.a、mean.a、min.b、max.b 填充它们mean.b) 值:
merged_genes %>% select(ENSGID, min.a:mean.b) -> merged_genes_subset
操纵 cluster
data-frame 以便每一行都有一个 ENSGID。添加名为 ENSGID
的列 header。将 merged_genes_subset
与集群合并,使用 ENSGID
作为公共 ID。
重新创建初始数据:
library(data.table)
cluster<- as.data.table(list(Genes = c("ENSG00000134684",
"ENSG00000188846, ENSG00000181163, ENSG00000114391",
"ENSG00000134684, ENSG00000175390",
"ENSG00000134684",
"ENSG00000134684, ENSG00000175390")))
expression<- as.data.table(list(ENSGID = c("ENSG00000134684", "ENSG00000175390",
"ENSG00000000419", "ENSG00000000457",
"ENSG00000000460", "ENSG00000000938",
"ENSG00000000971", "ENSG00000001036"),
a = c(1,2,131.23,7.11,15.70, 0, 0.03, 59.22),
b = c(3,0,108.73,8.68,6.59,0,0.07,58.3)))
setkey(cluster, Genes)
setkey(expression, ENSGID)
解决方案:
library(data.table)
result<- function() {
colnames<- c("min.a", "max.a", "mean.a", "min.b", "max.b", "mean.b")
# 1. "(colnames)" is parenthesized to insure we are adding new columns from
# colnames variable by reference and evaluates to character vector with
# new columns names
# 2. ":=" is for adding new columns to existing data.table by reference
# 3. "count(Genes)" calls count() function over "Genes" column, but as long
# as we are using grouping "by = Genes", count() works with each row turn
# by turn. And each row is a character vector.
cluster[,(colnames):=count(Genes), by = Genes]
}
# get Genes row
count<- function(charvector) {
ENSGIDc<- strsplit(charvector, ", ")
# 4. subsetting "expression" data.table rows by splitted "Genes" character
# vector named "ENSGIDc"...
# 5. ... and then calculating column's maxes, mins and means
expression[ENSGIDc, .(min(a, na.rm = T), max(a, na.rm = T),
mean(a, na.rm = T), min(b, na.rm = T),
max(b, na.rm = T), mean(b, na.rm = T))]
# 6. at this point we are returning resulting 1 row 6 columns data.table
# back to calling function, where it's added to "cluster" data.table
}
suppressWarnings(result())
我有一个数据框 cluster
,其中一列 cluster$Genes
如下所示:
ENSG00000134684
ENSG00000188846, ENSG00000181163, ENSG00000114391
ENSG00000134684, ENSG00000175390
ENSG00000134684
ENSG00000134684, ENSG00000175390
...
列中每行的元素数是任意的。我还有另一个数据框,expression
,看起来像这样:
ENSGID a b
ENSG00000134684 1 3
ENSG00000175390 2 0
ENSG00000000419 131.23 108.73
ENSG00000000457 7.11 8.68
ENSG00000000460 15.70 6.59
ENSG00000000938 0 0
ENSG00000000971 0.03 0.07
ENSG00000001036 59.22 58.3
...
... 大约有 20000 行。我想做的是:
- 对
cluster$Genes
中每一行的所有元素,求出对应的a
和b
值 - 为
cluster$Genes
中的每一行分别计算 - 在
cluster
数据框中创建六个新列并用(min.a, max.a, mean.a, min.b, max.b, mean.b)
值填充它们
a
和b
的最小值、最大值和平均值
我试图找到一些方法来做到这一点,但并不顺利。在谷歌上寻求帮助时,我想我可能会使用某种 apply
,并且我得到了一些代码。我认为它主要是胡言乱语并且完全不起作用,而且我有点卡住了。这是我得到的:
exp.lookup = function(genes) {
genes.split = strsplit(genes, ', ')
exp.hct = list()
exp.hke = list()
for ( gene in genes.split ) {
exp.hct = c(exp.hct, merge(gene, means$hct, all.x=TRUE))
exp.hke = c(exp.hke, merge(gene, means$hke, all.x=TRUE))
return(c(exp.hct, exp.hke))
}
}
apply(cluster['Genes'], 1, FUN=exp.lookup)
有没有人有更好的想法,可能真的有用?
假设每个 ENSGID
对应一对唯一的 a 和 b 值,我建议:
将
cluster$Genes
赋给一个变量(换句话说,在cluster
data-frame 之外复制它)。例如,new_cluster_genes <- cluster$Genes
操纵
new_cluster_genes
,使每一行都有一个ENSGID
。添加名为ENSGID
. 的列 header
将
new_cluster_genes
与表达式data-frame合并,使用ENSGID
作为通用ID。将结果 data-frame 分配给一个变量:例如,merged_genes
.计算每行 a 和 b 的最小值、最大值和平均值(分别):
library(dplyr) merged_genes %>% mutate(min.a = min(a), max.a = max(a), mean.a = mean(a), min.b = min(b), max.b = max(b), mean.b = mean(b)) -> merged_genes
创建 6 个新列并用 (min.a、max.a、mean.a、min.b、max.b 填充它们mean.b) 值:
merged_genes %>% select(ENSGID, min.a:mean.b) -> merged_genes_subset
操纵 cluster
data-frame 以便每一行都有一个 ENSGID。添加名为 ENSGID
的列 header。将 merged_genes_subset
与集群合并,使用 ENSGID
作为公共 ID。
重新创建初始数据:
library(data.table)
cluster<- as.data.table(list(Genes = c("ENSG00000134684",
"ENSG00000188846, ENSG00000181163, ENSG00000114391",
"ENSG00000134684, ENSG00000175390",
"ENSG00000134684",
"ENSG00000134684, ENSG00000175390")))
expression<- as.data.table(list(ENSGID = c("ENSG00000134684", "ENSG00000175390",
"ENSG00000000419", "ENSG00000000457",
"ENSG00000000460", "ENSG00000000938",
"ENSG00000000971", "ENSG00000001036"),
a = c(1,2,131.23,7.11,15.70, 0, 0.03, 59.22),
b = c(3,0,108.73,8.68,6.59,0,0.07,58.3)))
setkey(cluster, Genes)
setkey(expression, ENSGID)
解决方案:
library(data.table)
result<- function() {
colnames<- c("min.a", "max.a", "mean.a", "min.b", "max.b", "mean.b")
# 1. "(colnames)" is parenthesized to insure we are adding new columns from
# colnames variable by reference and evaluates to character vector with
# new columns names
# 2. ":=" is for adding new columns to existing data.table by reference
# 3. "count(Genes)" calls count() function over "Genes" column, but as long
# as we are using grouping "by = Genes", count() works with each row turn
# by turn. And each row is a character vector.
cluster[,(colnames):=count(Genes), by = Genes]
}
# get Genes row
count<- function(charvector) {
ENSGIDc<- strsplit(charvector, ", ")
# 4. subsetting "expression" data.table rows by splitted "Genes" character
# vector named "ENSGIDc"...
# 5. ... and then calculating column's maxes, mins and means
expression[ENSGIDc, .(min(a, na.rm = T), max(a, na.rm = T),
mean(a, na.rm = T), min(b, na.rm = T),
max(b, na.rm = T), mean(b, na.rm = T))]
# 6. at this point we are returning resulting 1 row 6 columns data.table
# back to calling function, where it's added to "cluster" data.table
}
suppressWarnings(result())