检测“data.table”中行的子集

Detect subsethood of rows in `data.table`

给定两个数据表,ab,我如何检查 a 的哪些行也在 b 中?输出应该是一个逻辑向量,其长度等于 a 的行数,并且与 a 的行的顺序相同,类似于向量的 %in%

例如,这是一个简单的非矢量化实现。大概有更快的方法来做到这一点。

library(data.table)

dt.in = function(a, b)
    sapply(1 : nrow(a), function(i)
        nrow(fintersect(a[i], b)) > 0)

stopifnot(identical(
   dt.in(
       data.table(
           c1 = c("c", "1", "c", "F", "p", "c", "r"),
           c2 = c("C", "B", "5", "f", "P", "C", "S")),
       data.table(c1 = letters, c2 = LETTERS)),
   c(T, F, F, F, T, T, F)))

如果我理解正确的话,这可以通过加入所有列来实现:

library(data.table)
# sample data 
dt1 <- data.table(
  c1 = c("c", "1", "c", "F", "p", "c", "r"),
  c2 = c("C", "B", "5", "f", "P", "C", "S"))
dt2 <- data.table(c1 = letters, c2 = LETTERS)

stopifnot(identical(names(dt1), names(dt2)))
!is.na(dt2[dt1, on = names(dt1), which = TRUE])
[1]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE

中,OP 指出列的顺序很重要。为了简单起见,我假设两个数据集的列名相同。

which = TRUE 要求 return df1 的行索引向量,否则在 df2NA 中匹配。根据要求将其转换为逻辑向量。


如果列名不相同并且匹配仅基于位置,这可以通过编程方式解决,例如

# modified sample data 
dt1 <- data.table(
  c1 = c("c", "1", "c", "F", "p", "c", "r"),
  c2 = c("C", "B", "5", "f", "P", "C", "S"))
dt2 <- data.table(v1 = letters, v2 = LETTERS)

!is.na(dt2[dt1, on = c(paste(names(dt2), names(dt1), sep = "==")), which = TRUE])

请注意,df2 的列现在命名为 v1v2,而不是 c1c2
连接子句 (on =) 已变为

"v1==c1" "v2==c2"

基于tidyverse的解决方案

library(tidyverse)

population <- data.frame(c1=letters, c2=LETTERS)
sample <- data.frame(
            c1 = c("c", "1", "c", "F", "p", "c", "r"),
            c2 = c("C", "B", "5", "f", "P", "C", "S"))

sample %>% 
  left_join(population %>% add_column(InPopulation=TRUE)) %>% 
  pull(InPopulation) %>% 
  replace_na(FALSE)
[1]  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE

这将连接数据框共有的所有列。 “共同点”是由名字决定的,而不是位置。