在 data.frame 列中检测模式,在 R 中 string_detect 太慢
detect pattern in a data.frame column with string_detect too slow in R
我有一个 data.frame,它有 50,000 行和 194 列。在名为 "Gene" 的列之一中,有一个或多个条目,始终遵循相同的模式,例如“gene1”或“gene1;gene2”或“gene1:gene2:gene3”。然后我有一个带有正则表达式模式的字符向量很长,例如 "\bgene1$|\bgene2$|\bgene3$|\bgene4$..." 总共 4,000模式,即 4,000 \bgene$.
我想在我的 data.frame
的 Gene
列中找到该模式的匹配项
这是我目前使用的代码示例
我无法输出整个data.frame因为它太长了
genes <- c("AARS", "AARS1", "SAMD11", "MUTYH", "PEGX", "BRCA1", "APC") # my real number of genes is 3,000
# then I converted the genes' vector to a regexp
genes2 <- paste0("\b", genes, "\b")
# then I try the matching
matches <- unique(grep(paste(genes2, collapse = "|"), # tib is my data.frame and Gene the column with the values I want to match
tib$Gene, value = TRUE, perl = FALSE))
# And finally filtering the data.frame
tib2 <- tib %>% dplyr::filter(Gene %in% matches)
然而,当我使用真实数据时,grep(设置 perl=FALSE
)出现内存不足错误,所以我尝试使用 stringr
库,但它太慢了完成搜索:
test <- str_extract_all(tib$Gene.refGene, paste(genes2,collapse="|"))
test2 <- str_detect((tib$Gene.refGene, paste(genes2,collapse="|"))
test
和 test2
都太慢了
关于如何更新的任何提示
行数较少的示例如下所示,由@jay.sf
提供
d <- structure(list(gene = c("XY42", "SAMD11:XY20:XY29:XY34:XY82:XY88:XY94",
"XY17:XY23:XY35:XY36:XY8", "MUTYH:XY43:XY62:XY85:XY91:XY92",
"AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95", "XY2:XY22:XY28:XY69:XY77",
"AARS1:XY11:XY17:XY62:XY75", "XY25:PEGX:XY47:XY6:XY76:XY84",
"APC:XY31:XY36:XY48:XY51:XY65", "BRCA1"), x = c(-1.04042150945666,
-0.4563032693248, -0.267762662765083, 0.758168827559491, -1.89440229591065,
0.468157951289336, 0.126909754004865, -0.852405668800981, -0.917059466430073,
-0.475954635098868)), class = "data.frame", row.names = c(NA,
-10L))
并且基因列表是固定的genes <- c("AARS", "AARS1", "SAMD11", "MUTYH", "PEGX", "BRCA1", "APC"
。我想在 Gene
列中找到基因列表成员和基因之间的精确匹配,即 BRCA1 (在基因列表中)应该只匹配 BRCA1 而不是 data.frame 中 Gene
列中的 BRCA11 .
但请记住,我的真实基因列表有 4,000 个基因,而我的 data.frame 由 50,000 行组成
我不确定你的输入和输出。但是假设这样的数据,
d
# gene x
# 1 XY42 -1.0404215
# 2 SAMD11:XY20:XY29:XY34:XY82:XY88:XY94 -0.4563033
# 3 XY17:XY23:XY35:XY36:XY8 -0.2677627
# 4 MUTYH:XY43:XY62:XY85:XY91:XY92 0.7581688
# 5 AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95 -1.8944023
# 6 XY2:XY22:XY28:XY69:XY77 0.4681580
# 7 AARS1:XY11:XY17:XY62:XY75 0.1269098
# 8 XY25:XY46:XY47:XY6:XY76:XY84 -0.8524057
# 9 XY22:XY31:XY36:XY48:XY51:XY65 -0.9170595
# 10 XY36 -0.4759546
您可以使用 strsplit
在 :
处分割基因,然后,首先,match
使用您的 genes
载体。
## all genes from d
d.genes.0 <- sort(unique(unlist(strsplit(d$gene, "\:"))))
## genes from d existing in `genes` vector `as.numeric`.
d.genes.1 <- as.numeric(na.omit(match(genes, d.genes.0)))
然后其次,我们将拆分后的基因(如上)转换为factor
s,并使用d.genes.0
作为因子水平;利用 factors
的数值转换,我们最终匹配数字而不是字符串。
rw <- sapply(strsplit(d$gene, "\:"), function(x)
any(d.genes.1 %in% as.numeric(factor(x, levels=d.genes.0))))
d[rw, ]
# gene x
# 2 SAMD11:XY20:XY29:XY34:XY82:XY88:XY94 -0.4563033
# 4 MUTYH:XY43:XY62:XY85:XY91:XY92 0.7581688
# 5 AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95 -1.8944023
# 7 AARS1:XY11:XY17:XY62:XY75 0.1269098
测试超过 4k 个基因和 50k 行,应该有效。
数据:
d <- structure(list(gene = c("XY42", "SAMD11:XY20:XY29:XY34:XY82:XY88:XY94",
"XY17:XY23:XY35:XY36:XY8", "MUTYH:XY43:XY62:XY85:XY91:XY92",
"AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95", "XY2:XY22:XY28:XY69:XY77",
"AARS1:XY11:XY17:XY62:XY75", "XY25:XY46:XY47:XY6:XY76:XY84",
"XY22:XY31:XY36:XY48:XY51:XY65", "XY36"), x = c(-1.04042150945666,
-0.4563032693248, -0.267762662765083, 0.758168827559491, -1.89440229591065,
0.468157951289336, 0.126909754004865, -0.852405668800981, -0.917059466430073,
-0.475954635098868)), class = "data.frame", row.names = c(NA,
-10L))
我有一个 data.frame,它有 50,000 行和 194 列。在名为 "Gene" 的列之一中,有一个或多个条目,始终遵循相同的模式,例如“gene1”或“gene1;gene2”或“gene1:gene2:gene3”。然后我有一个带有正则表达式模式的字符向量很长,例如 "\bgene1$|\bgene2$|\bgene3$|\bgene4$..." 总共 4,000模式,即 4,000 \bgene$.
我想在我的 data.frame
的Gene
列中找到该模式的匹配项
这是我目前使用的代码示例
我无法输出整个data.frame因为它太长了
genes <- c("AARS", "AARS1", "SAMD11", "MUTYH", "PEGX", "BRCA1", "APC") # my real number of genes is 3,000
# then I converted the genes' vector to a regexp
genes2 <- paste0("\b", genes, "\b")
# then I try the matching
matches <- unique(grep(paste(genes2, collapse = "|"), # tib is my data.frame and Gene the column with the values I want to match
tib$Gene, value = TRUE, perl = FALSE))
# And finally filtering the data.frame
tib2 <- tib %>% dplyr::filter(Gene %in% matches)
然而,当我使用真实数据时,grep(设置 perl=FALSE
)出现内存不足错误,所以我尝试使用 stringr
库,但它太慢了完成搜索:
test <- str_extract_all(tib$Gene.refGene, paste(genes2,collapse="|"))
test2 <- str_detect((tib$Gene.refGene, paste(genes2,collapse="|"))
test
和 test2
都太慢了
关于如何更新的任何提示
行数较少的示例如下所示,由@jay.sf
提供d <- structure(list(gene = c("XY42", "SAMD11:XY20:XY29:XY34:XY82:XY88:XY94",
"XY17:XY23:XY35:XY36:XY8", "MUTYH:XY43:XY62:XY85:XY91:XY92",
"AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95", "XY2:XY22:XY28:XY69:XY77",
"AARS1:XY11:XY17:XY62:XY75", "XY25:PEGX:XY47:XY6:XY76:XY84",
"APC:XY31:XY36:XY48:XY51:XY65", "BRCA1"), x = c(-1.04042150945666,
-0.4563032693248, -0.267762662765083, 0.758168827559491, -1.89440229591065,
0.468157951289336, 0.126909754004865, -0.852405668800981, -0.917059466430073,
-0.475954635098868)), class = "data.frame", row.names = c(NA,
-10L))
并且基因列表是固定的genes <- c("AARS", "AARS1", "SAMD11", "MUTYH", "PEGX", "BRCA1", "APC"
。我想在 Gene
列中找到基因列表成员和基因之间的精确匹配,即 BRCA1 (在基因列表中)应该只匹配 BRCA1 而不是 data.frame 中 Gene
列中的 BRCA11 .
但请记住,我的真实基因列表有 4,000 个基因,而我的 data.frame 由 50,000 行组成
我不确定你的输入和输出。但是假设这样的数据,
d
# gene x
# 1 XY42 -1.0404215
# 2 SAMD11:XY20:XY29:XY34:XY82:XY88:XY94 -0.4563033
# 3 XY17:XY23:XY35:XY36:XY8 -0.2677627
# 4 MUTYH:XY43:XY62:XY85:XY91:XY92 0.7581688
# 5 AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95 -1.8944023
# 6 XY2:XY22:XY28:XY69:XY77 0.4681580
# 7 AARS1:XY11:XY17:XY62:XY75 0.1269098
# 8 XY25:XY46:XY47:XY6:XY76:XY84 -0.8524057
# 9 XY22:XY31:XY36:XY48:XY51:XY65 -0.9170595
# 10 XY36 -0.4759546
您可以使用 strsplit
在 :
处分割基因,然后,首先,match
使用您的 genes
载体。
## all genes from d
d.genes.0 <- sort(unique(unlist(strsplit(d$gene, "\:"))))
## genes from d existing in `genes` vector `as.numeric`.
d.genes.1 <- as.numeric(na.omit(match(genes, d.genes.0)))
然后其次,我们将拆分后的基因(如上)转换为factor
s,并使用d.genes.0
作为因子水平;利用 factors
的数值转换,我们最终匹配数字而不是字符串。
rw <- sapply(strsplit(d$gene, "\:"), function(x)
any(d.genes.1 %in% as.numeric(factor(x, levels=d.genes.0))))
d[rw, ]
# gene x
# 2 SAMD11:XY20:XY29:XY34:XY82:XY88:XY94 -0.4563033
# 4 MUTYH:XY43:XY62:XY85:XY91:XY92 0.7581688
# 5 AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95 -1.8944023
# 7 AARS1:XY11:XY17:XY62:XY75 0.1269098
测试超过 4k 个基因和 50k 行,应该有效。
数据:
d <- structure(list(gene = c("XY42", "SAMD11:XY20:XY29:XY34:XY82:XY88:XY94",
"XY17:XY23:XY35:XY36:XY8", "MUTYH:XY43:XY62:XY85:XY91:XY92",
"AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95", "XY2:XY22:XY28:XY69:XY77",
"AARS1:XY11:XY17:XY62:XY75", "XY25:XY46:XY47:XY6:XY76:XY84",
"XY22:XY31:XY36:XY48:XY51:XY65", "XY36"), x = c(-1.04042150945666,
-0.4563032693248, -0.267762662765083, 0.758168827559491, -1.89440229591065,
0.468157951289336, 0.126909754004865, -0.852405668800981, -0.917059466430073,
-0.475954635098868)), class = "data.frame", row.names = c(NA,
-10L))