在 data.frame 列中检测模式，在 R 中 string_detect 太慢

Question

我有一个 data.frame，它有 50,000 行和 194 列。在名为 "Gene" 的列之一中，有一个或多个条目，始终遵循相同的模式，例如“gene1”或“gene1；gene2”或“gene1：gene2：gene3”。然后我有一个带有正则表达式模式的字符向量很长，例如 "\bgene1$|\bgene2$|\bgene3$|\bgene4$..." 总共 4,000模式，即 4,000 \bgene$.

我想在我的 data.frame

的 Gene 列中找到该模式的匹配项

这是我目前使用的代码示例

我无法输出整个data.frame因为它太长了

genes <- c("AARS", "AARS1", "SAMD11", "MUTYH", "PEGX", "BRCA1", "APC") # my real number of genes is 3,000

# then I converted the genes' vector to a regexp
genes2 <- paste0("\b", genes, "\b")

# then I try the matching
matches <- unique(grep(paste(genes2, collapse = "|"), # tib is my data.frame and Gene the column with the values I want to match
             tib$Gene, value = TRUE, perl = FALSE)) 

# And finally filtering the data.frame
tib2 <- tib %>% dplyr::filter(Gene %in% matches)

然而，当我使用真实数据时，grep（设置 perl=FALSE）出现内存不足错误，所以我尝试使用 stringr 库，但它太慢了完成搜索：

test <- str_extract_all(tib$Gene.refGene, paste(genes2,collapse="|"))
test2 <- str_detect((tib$Gene.refGene, paste(genes2,collapse="|"))

test 和 test2 都太慢了

关于如何更新的任何提示

行数较少的示例如下所示，由@jay.sf

提供

d <- structure(list(gene = c("XY42", "SAMD11:XY20:XY29:XY34:XY82:XY88:XY94", 
"XY17:XY23:XY35:XY36:XY8", "MUTYH:XY43:XY62:XY85:XY91:XY92", 
"AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95", "XY2:XY22:XY28:XY69:XY77", 
"AARS1:XY11:XY17:XY62:XY75", "XY25:PEGX:XY47:XY6:XY76:XY84", 
"APC:XY31:XY36:XY48:XY51:XY65", "BRCA1"), x = c(-1.04042150945666, 
-0.4563032693248, -0.267762662765083, 0.758168827559491, -1.89440229591065, 
0.468157951289336, 0.126909754004865, -0.852405668800981, -0.917059466430073, 
-0.475954635098868)), class = "data.frame", row.names = c(NA, 
-10L))

并且基因列表是固定的genes <- c("AARS", "AARS1", "SAMD11", "MUTYH", "PEGX", "BRCA1", "APC"。我想在 Gene 列中找到基因列表成员和基因之间的精确匹配，即 BRCA1 （在基因列表中）应该只匹配 BRCA1 而不是 data.frame 中 Gene 列中的 BRCA11 .

但请记住，我的真实基因列表有 4,000 个基因，而我的 data.frame 由 50,000 行组成

Answer 1

我不确定你的输入和输出。但是假设这样的数据，

d
#                                     gene          x
# 1                                   XY42 -1.0404215
# 2   SAMD11:XY20:XY29:XY34:XY82:XY88:XY94 -0.4563033
# 3                XY17:XY23:XY35:XY36:XY8 -0.2677627
# 4         MUTYH:XY43:XY62:XY85:XY91:XY92  0.7581688
# 5  AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95 -1.8944023
# 6                XY2:XY22:XY28:XY69:XY77  0.4681580
# 7              AARS1:XY11:XY17:XY62:XY75  0.1269098
# 8           XY25:XY46:XY47:XY6:XY76:XY84 -0.8524057
# 9          XY22:XY31:XY36:XY48:XY51:XY65 -0.9170595
# 10                                  XY36 -0.4759546

您可以使用 strsplit 在 : 处分割基因，然后，首先，match 使用您的 genes 载体。

## all genes from d
d.genes.0 <- sort(unique(unlist(strsplit(d$gene, "\:"))))
## genes from d existing in `genes` vector `as.numeric`.
d.genes.1 <- as.numeric(na.omit(match(genes, d.genes.0)))

然后其次，我们将拆分后的基因（如上）转换为factors，并使用d.genes.0作为因子水平；利用 factors 的数值转换，我们最终匹配数字而不是字符串。

rw <- sapply(strsplit(d$gene, "\:"), function(x) 
  any(d.genes.1 %in% as.numeric(factor(x, levels=d.genes.0))))
d[rw, ]
#                                    gene          x
# 2  SAMD11:XY20:XY29:XY34:XY82:XY88:XY94 -0.4563033
# 4        MUTYH:XY43:XY62:XY85:XY91:XY92  0.7581688
# 5 AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95 -1.8944023
# 7             AARS1:XY11:XY17:XY62:XY75  0.1269098

测试超过 4k 个基因和 50k 行，应该有效。

数据：

d <- structure(list(gene = c("XY42", "SAMD11:XY20:XY29:XY34:XY82:XY88:XY94", 
"XY17:XY23:XY35:XY36:XY8", "MUTYH:XY43:XY62:XY85:XY91:XY92", 
"AARS1:SAMD11:XY100:XY14:XY3:XY51:XY95", "XY2:XY22:XY28:XY69:XY77", 
"AARS1:XY11:XY17:XY62:XY75", "XY25:XY46:XY47:XY6:XY76:XY84", 
"XY22:XY31:XY36:XY48:XY51:XY65", "XY36"), x = c(-1.04042150945666, 
-0.4563032693248, -0.267762662765083, 0.758168827559491, -1.89440229591065, 
0.468157951289336, 0.126909754004865, -0.852405668800981, -0.917059466430073, 
-0.475954635098868)), class = "data.frame", row.names = c(NA, 
-10L))

在 data.frame 列中检测模式，在 R 中 string_detect 太慢

detect pattern in a data.frame column with string_detect too slow in R

r

stringr