如何确定两组变量是否在R中具有共享值?

How to determine whether two sets of variables have a shared value in R?

我有一个包含两组变量的数据,我想比较两组是否具有相同的值。在两组变量的每一行中,只要任意一对值相等,则标记为1,否则为0。如果数据中包含缺失值,我希望缺失值不参与比较。如果数据中包含字符变量,只要其实际值与数值变量的值相同,它们仍然被认为是相等的。

为了说明问题,我生成数据a1。我想确定第一组变量(z1x1)和第二组变量( z2x2) 相同并生成变量 result.

a1=data.table(z1=c(1,NA,3:5),x1=c("3",4:7),z2=c(2,NA,4:6),x2=c(3,5,4,7,5))
a1$result=c(1,0,0,0,1)

实际数据接近2000万行,每组变量较多。我想找到最有效的方法。非常感谢!

我们可以遍历行,找到对之间 intersectlength 并转换为逻辑

library(data.table)
a1[, result := +(apply(.SD, 1, FUN = function(x) 
     length(intersect(x[1:2], x[3:4]))) > 0)]

-输出

> a1
   z1 x1 z2 x2 result
1:  1  3  2  3      1
2:  2  4  3  5      0
3:  3  5  4  4      0
4:  4  6  5  7      0
5:  5  7  6  5      1

关于效率,dapply(来自collapse)可能比apply

更快
library(collapse)
a1[, result := dapply(.SD, MARGIN = 1, FUN = function(x)
          length(intersect(x[1:2], x[3:4])))]

或者使用带有 str_detect

的向量化选项
library(stringr)
a1[, result := +(str_detect(paste(z1, x1), paste0(z2, "|", x2)))]

这是另一种通用方法,它依赖于每个组的列名:

g1 = grep("1", names(a1), value = TRUE)
g2 = grep("2", names(a1), value = TRUE)

a1[, result := as.integer(
  apply(.SD, MARGIN = 1, FUN = function(x) any(x[g1] %in% x[g2]))
), .SDcols = c(g1, g2)]
a1
#    z1 x1 z2 x2 result
# 1:  1  3  2  3      1
# 2:  2  4  3  5      0
# 3:  3  5  4  4      0
# 4:  4  6  5  7      0
# 5:  5  7  6  5      1

The actual data is close to 20 million lines, and there are many variables in each group. I want to find the most efficient method

你可以转成长格式加入,看看有没有匹配的。估计是比较快了。

# this code should work for the original question (without character vectors or NAs)
# create a row id
a1[, row_id := .I]

# specify column groups
cols1 = c("x1", "z1")
cols2 = c("x2", "z2")

# transform to long form, drop colnames, drop dupes
longDT1 = unique(melt(a1[, c("row_id", ..cols1)], id.vars="row_id")[, !"variable"])
longDT2 = unique(melt(a1[, c("row_id", ..cols2)], id.vars="row_id")[, !"variable"])

# find any matches
w = longDT1[longDT2, on=.(row_id, value), which=TRUE, nomatch=0]

# find associated row_ids
match_row_ids = longDT1[w, unique(row_id)]

# flag rows
a1[, res := FALSE][match_row_ids, res := TRUE]

注意:如果您在某些列中混合使用字符值:

  • 您可以使用 type.convert 作为数据清理的一部分,以便从一开始就获得正确的类型。
  • 如果必须有字符串,则 longDT1 和 longDT2 中的 value 列都必须转换为字符串。

这绝对是一个难以扩展的问题。在对(比如)as.matrixapplyasplitdata.table::transpose 等进行了一些基准测试之后,我还没有找到一个可以合理扩展到超过 50K 行的。

最直接的(对我来说,从性能上来说,也是可口的)路径是最直接的:

a1[, result := +(z1 == z2 | z1 == x2 | x1 == z2 | x1 == x2)]

但是,NA 值失败了,所以我们需要更加小心。玩了一下,我觉得这个辅助函数是最直接的,因为它做的正是我们需要的逻辑,而且是全向量化的:

`%=%` <- function(a, b) !is.na(a) & !is.na(b) & a == b
a1[, +(z1 %=% z2 | x1 %=% z2 | z1 %=% x2 | x1 %=% x2)]
# [1] 1 0 0 0 1

(我有意避免使用 `%==%`,因为我在其他包中看到 NA %==% NA 为真。如果您更喜欢使用 `%==%`,感觉免费,或使用您选择的其他一些中缀运算符。它甚至不需要是中缀,这主要是为了美观。)

问题是当我们在每个 中有更多变量时如何自动执行此操作(由变量名称中的尾随数字定义)。为此,我建议我们手动创建表达式,然后 eval/parse 它。

g1 = grep("1", names(a1), value = TRUE)
g2 = grep("2", names(a1), value = TRUE)
expr <- paste0(
  "+(",
  paste(outer(g1, g2, function(a, b) sprintf("%s %%=%% %s", a, b)), collapse = " | "),
  ")")
expr
# [1] "+(z1 %=% z2 | x1 %=% z2 | z1 %=% x2 | x1 %=% x2)"

这会产生预期的结果:

a1[, result2 := eval(parse(text = expr))]
#       z1     x1    z2    x2 result result2
#    <num> <char> <num> <num>  <num>   <int>
# 1:     1      3     2     3      1       1
# 2:    NA      4    NA     5      0       0
# 3:     3      5     4     4      0       0
# 4:     4      6     5     7      0       0
# 5:     5      7     6     5      1       1

这可以很好地垂直扩展。如果 a1 是 5 行,那么复制它 1e4 次会产生 50K 行,等等

a1e4 <- rbindlist(replicate(1e4, a1, simplify=FALSE))    # 50K rows
system.time(a1e4[, result2 := eval(parse(text = expr))])
#    user  system elapsed 
#    0.06    0.00    0.06 

a1e5 <- rbindlist(replicate(1e5, a1, simplify=FALSE))    # 500K
system.time(a1e5[, result2 := eval(parse(text = expr))])
#    user  system elapsed 
#     0.7     0.0     0.7 

a1e6 <- rbindlist(replicate(1e6, a1, simplify=FALSE))    # 5M
system.time(a1e6[, result2 := eval(parse(text = expr))])
#    user  system elapsed 
#    7.16    0.06    7.22 

它似乎是线性扩展的,这意味着另外 4 倍的行应该在大约 30 秒内解析。

如果每组有更多变量呢? (即,水平缩放

set.seed(42)
b1 <- copy(a1[,1:4])[, c("s1","t1","u1","v1","w1","y1", "s2","t2","u2","v2","w2","y2") :=
                         replicate(12, sample(9, .N, replace = TRUE), simplify = FALSE)]
b1
#       z1     x1    z2    x2    s1    t1    u1    v1    w1    y1    s2    t2    u2    v2    w2    y2
#    <num> <char> <num> <num> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1:     1      3     2     3     1     2     9     9     4     8     6     8     1     2     2     1
# 2:    NA      4    NA     5     5     1     5     9     2     6     2     2     5     4     7     1
# 3:     3      5     4     4     1     8     4     4     8     8     5     3     2     3     6     7
# 4:     4      6     5     7     9     7     2     5     3     4     4     8     6     6     8     4
# 5:     5      7     6     5     4     4     3     5     1     4     2     7     6     5     5     9
bg1 = grep("1", names(b1), value = TRUE)
bg2 = grep("2", names(b1), value = TRUE)
bexpr <- paste0(
  "+(",
  paste(outer(bg1, bg2, function(a, b) sprintf("%s %%=%% %s", a, b)), collapse = " | "),
  ")")
bexpr
# [1] "+(z1 %=% z2 | x1 %=% z2 | s1 %=% z2 | t1 %=% z2 | u1 %=% z2 | v1 %=% z2 | w1 %=% z2 | y1 %=% z2 | z1 %=% x2 | x1 %=% x2 | s1 %=% x2 | t1 %=% x2 | u1 %=% x2 | v1 %=% x2 | w1 %=% x2 | y1 %=% x2 | z1 %=% s2 | x1 %=% s2 | s1 %=% s2 | t1 %=% s2 | u1 %=% s2 | v1 %=% s2 | w1 %=% s2 | y1 %=% s2 | z1 %=% t2 | x1 %=% t2 | s1 %=% t2 | t1 %=% t2 | u1 %=% t2 | v1 %=% t2 | w1 %=% t2 | y1 %=% t2 | z1 %=% u2 | x1 %=% u2 | s1 %=% u2 | t1 %=% u2 | u1 %=% u2 | v1 %=% u2 | w1 %=% u2 | y1 %=% u2 | z1 %=% v2 | x1 %=% v2 | s1 %=% v2 | t1 %=% v2 | u1 %=% v2 | v1 %=% v2 | w1 %=% v2 | y1 %=% v2 | z1 %=% w2 | x1 %=% w2 | s1 %=% w2 | t1 %=% w2 | u1 %=% w2 | v1 %=% w2 | w1 %=% w2 | y1 %=% w2 | z1 %=% y2 | x1 %=% y2 | s1 %=% y2 | t1 %=% y2 | u1 %=% y2 | v1 %=% y2 | w1 %=% y2 | y1 %=% y2)"

呃,这看起来很糟糕,但是每组 8 个变量的性能扩展非常好:

b1e4 <- rbindlist(replicate(1e4, b1, simplify=FALSE))
system.time(b1e4[, result2 := eval(parse(text = bexpr))])
#    user  system elapsed 
#    0.11    0.00    0.10 

b1e5 <- rbindlist(replicate(1e5, b1, simplify=FALSE))
system.time(b1e5[, result2 := eval(parse(text = bexpr))])
#    user  system elapsed 
#    1.03    0.00    1.03 

b1e6 <- rbindlist(replicate(1e6, b1, simplify=FALSE))
system.time(b1e6[, result2 := eval(parse(text = bexpr))])
#    user  system elapsed 
#   11.72    0.51   12.25