测试两列字符串以在 R 中逐行匹配

Test two columns of strings for match row-wise in R

假设我有两列字符串:

library(data.table)
DT <- data.table(x = c("a","aa","bb"), y = c("b","a","bbb"))

对于每一行,我想知道 x 中的字符串是否存在于 y 列中。循环方法是:

for (i in 1:length(DT$x)){
  DT$test[i] <- DT[i,grepl(x,y) + 0]
}

DT
    x   y test
1:  a   b    0
2: aa   a    0
3: bb bbb    1

有这个的向量化实现吗?使用 grep(DT$x,DT$y) 仅使用 x 的第一个元素。

您可以将 grepl 函数传递给应用函数以对数据的每一行进行操作 table 其中第一列包含要搜索的字符串,第二列包含要搜索的字符串搜索。这应该为您的问题提供矢量化解决方案。

> DT$test <- apply(DT, 1, function(x) as.integer(grepl(x[1], x[2])))
> DT
    x   y test
1:  a   b    0
2: aa   a    0
3: bb bbb    1

您可以使用 Vectorize:

vgrepl <- Vectorize(grepl)
DT[, test := as.integer(vgrepl(x, y))]
DT
    x   y test
1:  a   b    0
2: aa   a    0
3: bb bbb    1

mapplyVectorize实际上只是mapply的包装)

DT$test <- mapply(grepl, pattern=DT$x, x=DT$y)

你可以简单地做

DT[, test := grepl(x, y), by = x]

感谢大家的回复。我已经对所有这些进行了基准测试,并得出以下结论:

library(data.table)
library(microbenchmark)

DT <- data.table(x = rep(c("a","aa","bb"),1000), y = rep(c("b","a","bbb"),1000))

DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
DT4 <- copy(DT)

microbenchmark(
DT1[, test := grepl(x, y), by = x]
,
DT2$test <- apply(DT, 1, function(x) grepl(x[1], x[2]))
,
DT3$test <- mapply(grepl, pattern=DT3$x, x=DT3$y)
,
{vgrepl <- Vectorize(grepl)
DT4[, test := as.integer(vgrepl(x, y))]}
)

结果

Unit: microseconds
                                                                               expr       min        lq       mean     median        uq        max neval
                                             DT1[, `:=`(test, grepl(x, y)), by = x]   758.339   908.106   982.1417   959.6115  1035.446   1883.872   100
                            DT2$test <- apply(DT, 1, function(x) grepl(x[1], x[2])) 16840.818 18032.683 18994.0858 18723.7410 19578.060  23730.106   100
                              DT3$test <- mapply(grepl, pattern = DT3$x, x = DT3$y) 14339.632 15068.320 16907.0582 15460.6040 15892.040 117110.286   100
 {     vgrepl <- Vectorize(grepl)     DT4[, `:=`(test, as.integer(vgrepl(x, y)))] } 14282.233 15170.003 16247.6799 15544.4205 16306.560  26648.284   100

除了语法最简单之外,data.table 解决方案也是最快的。