对于非常大的字符串，R 更快的 gregexpr

Question

我正在尝试使用 gregexpr 在大字符串中搜索 "ABCD" 的位置，并在同一字符串中搜索 "ABBD, ACCD, AAAD" 的位置。我想在数据 table 的两个单独列中输出 "ABCD" 搜索结果和 "ABBD, ACCD, AAAD" 搜索结果。

我目前的做法是单独使用gregexpr，分别导出为1列的txt文件，分别导入为矩阵，对每个1列矩阵进行排序，使数字按行升序，列绑定两个矩阵，转换结果两列矩阵变成了一个数据table。

这种方法在处理非常大的字符串时似乎效率很低，并且需要相当长的时间才能完成。有什么办法可以优化程序吗？感谢您的帮助！

# dummy string that is relatively short for this demo
x <- "ABCDACCDABBDABCDAAADACCDABBDABCD"

# SEARCH for 'ABCD' location
out1 <- gregexpr(pattern = "ABCD", x)
cat(paste(c(out1[[1]]), sep = "\n", collapse = "\n"), file = "~/out_1.txt")    

# SEARCH for 'A??D' location
outB <- gregexpr(pattern = "ABBD", x)
outC <- gregexpr(pattern = "ACCD", x)
outA <- gregexpr(pattern = "AAAD", x)
cat(paste(c(outA[[1]], outB[[1]], outC[[1]]), collapse = "\n"), file = "~/out_2.txt")

# Function that BINDS Matrices by column
cbind.fill <- function(...){
  nm <- list(...)
  nm <- lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow))
  do.call(cbind, lapply(nm, function (x) rbind(x, matrix(, n-nrow(x), ncol(x)))))
}

# Load as Tables --> Sort by numbers increasing --> Matrices
mat1 <- as.matrix(read.table("~/out_1.txt"))
mat2.t <- (read.table("~/out_2.txt"))
mat2 <- as.matrix(mat2.t[order(mat2.t$V1),])

# Combine two matrices to create 2 column matrix 
comb_mat <- cbind.fill(mat1, mat2)
write.table(comb_mat, file = "~/comb_mat.txt", row.names = FALSE, col.names = FALSE)

Answer 1

不需要中间文件。
我会使用 gregexpr() 的 fixed=T 参数，这可能会产生性能优势。来自 https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html:

If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and fixed = TRUE faster still (especially when each pattern is matched only a few times).

您可以使用sort()立即对第二列进行排序，而不是存储一个中间变量然后使用order()对其进行索引。
你的 cbind.fill() 函数可以工作，但是 NA 填充的任务可以通过越界索引轻松完成，为此 R 自然 returns NA 用于越界索引。

因此：

x <- 'ABCDACCDABBDABCDAAADACCDABBDABCD';
out1 <- c(gregexpr('ABCD',x,fixed=T)[[1]]);
out2 <- sort(c(gregexpr('AAAD',x,fixed=T)[[1]],gregexpr('ABBD',x,fixed=T)[[1]],gregexpr('ACCD',x,fixed=T)[[1]]));
outmax <- max(length(out1),length(out2));
comb_mat <- cbind(out1[1:outmax],out2[1:outmax]);
comb_mat;
##      [,1] [,2]
## [1,]    1    5
## [2,]   13    9
## [3,]   29   17
## [4,]   NA   21
## [5,]   NA   25

然后您可以根据您的 write.table() 调用将 comb_mat 写入文件。

编辑： 正如您（现在我）发现的那样，gregexpr() 在大字符串上的表现出奇地差，而您的 237MB 字符串绝对是一个大字符串。从 Fast partial string matching in R, we can use the stringi package to speed up performance. What follows is a demo of how to use stringi::stri_locate_all() 来完成您的要求。一些注意事项：

为了我自己的测试，我创建了自己的 237MB 文件，实际上正好是 237,000,001 字节大小。我基本上使用 vim 将您的 32 字节示例字符串重复 7,406,250 次，总计 237,000,000 字节，额外的字节来自 vim 附加的 LF。我将我的测试文件命名为 x，你可以看到我用 data.table::fread() 加载它，因为 read.table() 花费的时间太长了。
我对我的 NA 填充算法做了一个小改动。我意识到我们可以将向量的长度分配给最大长度，而不是使用越界索引，利用赋值运算符的从右到左的关联性。这里的好处是我们不必再构造索引向量1:outmax。

因此：

library('data.table');
library('stringi');
x <- fread('x',header=F)$V1;
## Read 1 rows and 1 (of 1) columns from 0.221 GB file in 00:00:03
system.time({ out1 <- stri_locate_all(x,regex='ABCD')[[1]][,'start']; });
##    user  system elapsed
##   3.687   0.359   4.044
system.time({ out2 <- stri_locate_all(x,regex='AAAD|ABBD|ACCD')[[1]][,'start']; });
##    user  system elapsed
##   4.938   0.454   5.404
length(out1);
## [1] 22218750
length(out2);
## [1] 37031250
length(out1) <- length(out2) <- max(length(out1),length(out2));
comb_mat <- cbind(out1,out2);
head(comb_mat);
##      out1 out2
## [1,]    1    5
## [2,]   13    9
## [3,]   29   17
## [4,]   33   21
## [5,]   45   25
## [6,]   61   37
tail(comb_mat);
##             out1      out2
## [37031245,]   NA 236999961
## [37031246,]   NA 236999973
## [37031247,]   NA 236999977
## [37031248,]   NA 236999985
## [37031249,]   NA 236999989
## [37031250,]   NA 236999993
nrow(comb_mat);
## [1] 37031250

Answer 2

您可以使用前瞻来简化它，因此您只有一个包含两个捕获组件的正则表达式。

ms <- gregexpr("A(?=(BCD)|(BBD|CCD|AAD))", x, perl=T)
res <- attr(ms[[1]], "capture.start")
res[res>0] <- res[res>0]-1

在这个矩阵中，res，第一列是ABCD的起始位置，第二列是其他三列的起始位置。如果你愿意，你可以用 NA 替换零。

# [1,]  1  0
# [2,]  0  5
# [3,]  0  9
# [4,] 13  0
# [5,]  0 17
# [6,]  0 21
# [7,]  0 25
# [8,] 29  0

Answer 3

另一种使用 stringi 包的方法：

library(stringi)

x <- 'ABCDACCDABBDABCDAAADACCDABBDABCD'

m <- stri_locate_all_regex(x, c('ABCD', 'AAAD|ABBD|ACCD'))

l <- list(m[[1]][,'start'], m[[2]][,'start'])
do.call(cbind, lapply(l, `[`, seq_len(max(sapply(l, length)))))

#      [,1] [,2]
# [1,]    1    5
# [2,]   13    9
# [3,]   29   17
# [4,]   NA   21
# [5,]   NA   25

或者您可以尝试使用 zoo 包：

m <- coredata(do.call(cbind, lapply(l, zoo)))
colnames(m) <- NULL

#      [,1] [,2]
# [1,]    1    5
# [2,]   13    9
# [3,]   29   17
# [4,]   NA   21
# [5,]   NA   25

对于非常大的字符串，R 更快的 gregexpr

R Faster gregexpr for very large strings

regex

r

matrix

data.table