如何确定两组变量是否在R中具有共享值?
How to determine whether two sets of variables have a shared value in R?
我有一个包含两组变量的数据,我想比较两组是否具有相同的值。在两组变量的每一行中,只要任意一对值相等,则标记为1,否则为0。如果数据中包含缺失值,我希望缺失值不参与比较。如果数据中包含字符变量,只要其实际值与数值变量的值相同,它们仍然被认为是相等的。
为了说明问题,我生成数据a1。我想确定第一组变量(z1 和 x1)和第二组变量( z2 和 x2) 相同并生成变量 result.
a1=data.table(z1=c(1,NA,3:5),x1=c("3",4:7),z2=c(2,NA,4:6),x2=c(3,5,4,7,5))
a1$result=c(1,0,0,0,1)
实际数据接近2000万行,每组变量较多。我想找到最有效的方法。非常感谢!
我们可以遍历行,找到对之间 intersect
的 length
并转换为逻辑
library(data.table)
a1[, result := +(apply(.SD, 1, FUN = function(x)
length(intersect(x[1:2], x[3:4]))) > 0)]
-输出
> a1
z1 x1 z2 x2 result
1: 1 3 2 3 1
2: 2 4 3 5 0
3: 3 5 4 4 0
4: 4 6 5 7 0
5: 5 7 6 5 1
关于效率,dapply
(来自collapse
)可能比apply
更快
library(collapse)
a1[, result := dapply(.SD, MARGIN = 1, FUN = function(x)
length(intersect(x[1:2], x[3:4])))]
或者使用带有 str_detect
的向量化选项
library(stringr)
a1[, result := +(str_detect(paste(z1, x1), paste0(z2, "|", x2)))]
这是另一种通用方法,它依赖于每个组的列名:
g1 = grep("1", names(a1), value = TRUE)
g2 = grep("2", names(a1), value = TRUE)
a1[, result := as.integer(
apply(.SD, MARGIN = 1, FUN = function(x) any(x[g1] %in% x[g2]))
), .SDcols = c(g1, g2)]
a1
# z1 x1 z2 x2 result
# 1: 1 3 2 3 1
# 2: 2 4 3 5 0
# 3: 3 5 4 4 0
# 4: 4 6 5 7 0
# 5: 5 7 6 5 1
The actual data is close to 20 million lines, and there are many variables in each group. I want to find the most efficient method
你可以转成长格式加入,看看有没有匹配的。估计是比较快了。
# this code should work for the original question (without character vectors or NAs)
# create a row id
a1[, row_id := .I]
# specify column groups
cols1 = c("x1", "z1")
cols2 = c("x2", "z2")
# transform to long form, drop colnames, drop dupes
longDT1 = unique(melt(a1[, c("row_id", ..cols1)], id.vars="row_id")[, !"variable"])
longDT2 = unique(melt(a1[, c("row_id", ..cols2)], id.vars="row_id")[, !"variable"])
# find any matches
w = longDT1[longDT2, on=.(row_id, value), which=TRUE, nomatch=0]
# find associated row_ids
match_row_ids = longDT1[w, unique(row_id)]
# flag rows
a1[, res := FALSE][match_row_ids, res := TRUE]
注意:如果您在某些列中混合使用字符值:
- 您可以使用
type.convert
作为数据清理的一部分,以便从一开始就获得正确的类型。
- 如果必须有字符串,则 longDT1 和 longDT2 中的
value
列都必须转换为字符串。
这绝对是一个难以扩展的问题。在对(比如)as.matrix
、apply
、asplit
、data.table::transpose
等进行了一些基准测试之后,我还没有找到一个可以合理扩展到超过 50K 行的。
最直接的(对我来说,从性能上来说,也是可口的)路径是最直接的:
a1[, result := +(z1 == z2 | z1 == x2 | x1 == z2 | x1 == x2)]
但是,NA
值失败了,所以我们需要更加小心。玩了一下,我觉得这个辅助函数是最直接的,因为它做的正是我们需要的逻辑,而且是全向量化的:
`%=%` <- function(a, b) !is.na(a) & !is.na(b) & a == b
a1[, +(z1 %=% z2 | x1 %=% z2 | z1 %=% x2 | x1 %=% x2)]
# [1] 1 0 0 0 1
(我有意避免使用 `%==%`
,因为我在其他包中看到 NA %==% NA
为真。如果您更喜欢使用 `%==%`
,感觉免费,或使用您选择的其他一些中缀运算符。它甚至不需要是中缀,这主要是为了美观。)
问题是当我们在每个 组 中有更多变量时如何自动执行此操作(由变量名称中的尾随数字定义)。为此,我建议我们手动创建表达式,然后 eval/parse 它。
g1 = grep("1", names(a1), value = TRUE)
g2 = grep("2", names(a1), value = TRUE)
expr <- paste0(
"+(",
paste(outer(g1, g2, function(a, b) sprintf("%s %%=%% %s", a, b)), collapse = " | "),
")")
expr
# [1] "+(z1 %=% z2 | x1 %=% z2 | z1 %=% x2 | x1 %=% x2)"
这会产生预期的结果:
a1[, result2 := eval(parse(text = expr))]
# z1 x1 z2 x2 result result2
# <num> <char> <num> <num> <num> <int>
# 1: 1 3 2 3 1 1
# 2: NA 4 NA 5 0 0
# 3: 3 5 4 4 0 0
# 4: 4 6 5 7 0 0
# 5: 5 7 6 5 1 1
这可以很好地垂直扩展。如果 a1
是 5 行,那么复制它 1e4
次会产生 50K 行,等等
a1e4 <- rbindlist(replicate(1e4, a1, simplify=FALSE)) # 50K rows
system.time(a1e4[, result2 := eval(parse(text = expr))])
# user system elapsed
# 0.06 0.00 0.06
a1e5 <- rbindlist(replicate(1e5, a1, simplify=FALSE)) # 500K
system.time(a1e5[, result2 := eval(parse(text = expr))])
# user system elapsed
# 0.7 0.0 0.7
a1e6 <- rbindlist(replicate(1e6, a1, simplify=FALSE)) # 5M
system.time(a1e6[, result2 := eval(parse(text = expr))])
# user system elapsed
# 7.16 0.06 7.22
它似乎是线性扩展的,这意味着另外 4 倍的行应该在大约 30 秒内解析。
如果每组有更多变量呢? (即,水平缩放 )
set.seed(42)
b1 <- copy(a1[,1:4])[, c("s1","t1","u1","v1","w1","y1", "s2","t2","u2","v2","w2","y2") :=
replicate(12, sample(9, .N, replace = TRUE), simplify = FALSE)]
b1
# z1 x1 z2 x2 s1 t1 u1 v1 w1 y1 s2 t2 u2 v2 w2 y2
# <num> <char> <num> <num> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 3 2 3 1 2 9 9 4 8 6 8 1 2 2 1
# 2: NA 4 NA 5 5 1 5 9 2 6 2 2 5 4 7 1
# 3: 3 5 4 4 1 8 4 4 8 8 5 3 2 3 6 7
# 4: 4 6 5 7 9 7 2 5 3 4 4 8 6 6 8 4
# 5: 5 7 6 5 4 4 3 5 1 4 2 7 6 5 5 9
bg1 = grep("1", names(b1), value = TRUE)
bg2 = grep("2", names(b1), value = TRUE)
bexpr <- paste0(
"+(",
paste(outer(bg1, bg2, function(a, b) sprintf("%s %%=%% %s", a, b)), collapse = " | "),
")")
bexpr
# [1] "+(z1 %=% z2 | x1 %=% z2 | s1 %=% z2 | t1 %=% z2 | u1 %=% z2 | v1 %=% z2 | w1 %=% z2 | y1 %=% z2 | z1 %=% x2 | x1 %=% x2 | s1 %=% x2 | t1 %=% x2 | u1 %=% x2 | v1 %=% x2 | w1 %=% x2 | y1 %=% x2 | z1 %=% s2 | x1 %=% s2 | s1 %=% s2 | t1 %=% s2 | u1 %=% s2 | v1 %=% s2 | w1 %=% s2 | y1 %=% s2 | z1 %=% t2 | x1 %=% t2 | s1 %=% t2 | t1 %=% t2 | u1 %=% t2 | v1 %=% t2 | w1 %=% t2 | y1 %=% t2 | z1 %=% u2 | x1 %=% u2 | s1 %=% u2 | t1 %=% u2 | u1 %=% u2 | v1 %=% u2 | w1 %=% u2 | y1 %=% u2 | z1 %=% v2 | x1 %=% v2 | s1 %=% v2 | t1 %=% v2 | u1 %=% v2 | v1 %=% v2 | w1 %=% v2 | y1 %=% v2 | z1 %=% w2 | x1 %=% w2 | s1 %=% w2 | t1 %=% w2 | u1 %=% w2 | v1 %=% w2 | w1 %=% w2 | y1 %=% w2 | z1 %=% y2 | x1 %=% y2 | s1 %=% y2 | t1 %=% y2 | u1 %=% y2 | v1 %=% y2 | w1 %=% y2 | y1 %=% y2)"
呃,这看起来很糟糕,但是每组 8 个变量的性能扩展非常好:
b1e4 <- rbindlist(replicate(1e4, b1, simplify=FALSE))
system.time(b1e4[, result2 := eval(parse(text = bexpr))])
# user system elapsed
# 0.11 0.00 0.10
b1e5 <- rbindlist(replicate(1e5, b1, simplify=FALSE))
system.time(b1e5[, result2 := eval(parse(text = bexpr))])
# user system elapsed
# 1.03 0.00 1.03
b1e6 <- rbindlist(replicate(1e6, b1, simplify=FALSE))
system.time(b1e6[, result2 := eval(parse(text = bexpr))])
# user system elapsed
# 11.72 0.51 12.25
我有一个包含两组变量的数据,我想比较两组是否具有相同的值。在两组变量的每一行中,只要任意一对值相等,则标记为1,否则为0。如果数据中包含缺失值,我希望缺失值不参与比较。如果数据中包含字符变量,只要其实际值与数值变量的值相同,它们仍然被认为是相等的。
为了说明问题,我生成数据a1。我想确定第一组变量(z1 和 x1)和第二组变量( z2 和 x2) 相同并生成变量 result.
a1=data.table(z1=c(1,NA,3:5),x1=c("3",4:7),z2=c(2,NA,4:6),x2=c(3,5,4,7,5))
a1$result=c(1,0,0,0,1)
实际数据接近2000万行,每组变量较多。我想找到最有效的方法。非常感谢!
我们可以遍历行,找到对之间 intersect
的 length
并转换为逻辑
library(data.table)
a1[, result := +(apply(.SD, 1, FUN = function(x)
length(intersect(x[1:2], x[3:4]))) > 0)]
-输出
> a1
z1 x1 z2 x2 result
1: 1 3 2 3 1
2: 2 4 3 5 0
3: 3 5 4 4 0
4: 4 6 5 7 0
5: 5 7 6 5 1
关于效率,dapply
(来自collapse
)可能比apply
library(collapse)
a1[, result := dapply(.SD, MARGIN = 1, FUN = function(x)
length(intersect(x[1:2], x[3:4])))]
或者使用带有 str_detect
library(stringr)
a1[, result := +(str_detect(paste(z1, x1), paste0(z2, "|", x2)))]
这是另一种通用方法,它依赖于每个组的列名:
g1 = grep("1", names(a1), value = TRUE)
g2 = grep("2", names(a1), value = TRUE)
a1[, result := as.integer(
apply(.SD, MARGIN = 1, FUN = function(x) any(x[g1] %in% x[g2]))
), .SDcols = c(g1, g2)]
a1
# z1 x1 z2 x2 result
# 1: 1 3 2 3 1
# 2: 2 4 3 5 0
# 3: 3 5 4 4 0
# 4: 4 6 5 7 0
# 5: 5 7 6 5 1
The actual data is close to 20 million lines, and there are many variables in each group. I want to find the most efficient method
你可以转成长格式加入,看看有没有匹配的。估计是比较快了。
# this code should work for the original question (without character vectors or NAs)
# create a row id
a1[, row_id := .I]
# specify column groups
cols1 = c("x1", "z1")
cols2 = c("x2", "z2")
# transform to long form, drop colnames, drop dupes
longDT1 = unique(melt(a1[, c("row_id", ..cols1)], id.vars="row_id")[, !"variable"])
longDT2 = unique(melt(a1[, c("row_id", ..cols2)], id.vars="row_id")[, !"variable"])
# find any matches
w = longDT1[longDT2, on=.(row_id, value), which=TRUE, nomatch=0]
# find associated row_ids
match_row_ids = longDT1[w, unique(row_id)]
# flag rows
a1[, res := FALSE][match_row_ids, res := TRUE]
注意:如果您在某些列中混合使用字符值:
- 您可以使用
type.convert
作为数据清理的一部分,以便从一开始就获得正确的类型。 - 如果必须有字符串,则 longDT1 和 longDT2 中的
value
列都必须转换为字符串。
这绝对是一个难以扩展的问题。在对(比如)as.matrix
、apply
、asplit
、data.table::transpose
等进行了一些基准测试之后,我还没有找到一个可以合理扩展到超过 50K 行的。
最直接的(对我来说,从性能上来说,也是可口的)路径是最直接的:
a1[, result := +(z1 == z2 | z1 == x2 | x1 == z2 | x1 == x2)]
但是,NA
值失败了,所以我们需要更加小心。玩了一下,我觉得这个辅助函数是最直接的,因为它做的正是我们需要的逻辑,而且是全向量化的:
`%=%` <- function(a, b) !is.na(a) & !is.na(b) & a == b
a1[, +(z1 %=% z2 | x1 %=% z2 | z1 %=% x2 | x1 %=% x2)]
# [1] 1 0 0 0 1
(我有意避免使用 `%==%`
,因为我在其他包中看到 NA %==% NA
为真。如果您更喜欢使用 `%==%`
,感觉免费,或使用您选择的其他一些中缀运算符。它甚至不需要是中缀,这主要是为了美观。)
问题是当我们在每个 组 中有更多变量时如何自动执行此操作(由变量名称中的尾随数字定义)。为此,我建议我们手动创建表达式,然后 eval/parse 它。
g1 = grep("1", names(a1), value = TRUE)
g2 = grep("2", names(a1), value = TRUE)
expr <- paste0(
"+(",
paste(outer(g1, g2, function(a, b) sprintf("%s %%=%% %s", a, b)), collapse = " | "),
")")
expr
# [1] "+(z1 %=% z2 | x1 %=% z2 | z1 %=% x2 | x1 %=% x2)"
这会产生预期的结果:
a1[, result2 := eval(parse(text = expr))]
# z1 x1 z2 x2 result result2
# <num> <char> <num> <num> <num> <int>
# 1: 1 3 2 3 1 1
# 2: NA 4 NA 5 0 0
# 3: 3 5 4 4 0 0
# 4: 4 6 5 7 0 0
# 5: 5 7 6 5 1 1
这可以很好地垂直扩展。如果 a1
是 5 行,那么复制它 1e4
次会产生 50K 行,等等
a1e4 <- rbindlist(replicate(1e4, a1, simplify=FALSE)) # 50K rows
system.time(a1e4[, result2 := eval(parse(text = expr))])
# user system elapsed
# 0.06 0.00 0.06
a1e5 <- rbindlist(replicate(1e5, a1, simplify=FALSE)) # 500K
system.time(a1e5[, result2 := eval(parse(text = expr))])
# user system elapsed
# 0.7 0.0 0.7
a1e6 <- rbindlist(replicate(1e6, a1, simplify=FALSE)) # 5M
system.time(a1e6[, result2 := eval(parse(text = expr))])
# user system elapsed
# 7.16 0.06 7.22
它似乎是线性扩展的,这意味着另外 4 倍的行应该在大约 30 秒内解析。
如果每组有更多变量呢? (即,水平缩放 )
set.seed(42)
b1 <- copy(a1[,1:4])[, c("s1","t1","u1","v1","w1","y1", "s2","t2","u2","v2","w2","y2") :=
replicate(12, sample(9, .N, replace = TRUE), simplify = FALSE)]
b1
# z1 x1 z2 x2 s1 t1 u1 v1 w1 y1 s2 t2 u2 v2 w2 y2
# <num> <char> <num> <num> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1: 1 3 2 3 1 2 9 9 4 8 6 8 1 2 2 1
# 2: NA 4 NA 5 5 1 5 9 2 6 2 2 5 4 7 1
# 3: 3 5 4 4 1 8 4 4 8 8 5 3 2 3 6 7
# 4: 4 6 5 7 9 7 2 5 3 4 4 8 6 6 8 4
# 5: 5 7 6 5 4 4 3 5 1 4 2 7 6 5 5 9
bg1 = grep("1", names(b1), value = TRUE)
bg2 = grep("2", names(b1), value = TRUE)
bexpr <- paste0(
"+(",
paste(outer(bg1, bg2, function(a, b) sprintf("%s %%=%% %s", a, b)), collapse = " | "),
")")
bexpr
# [1] "+(z1 %=% z2 | x1 %=% z2 | s1 %=% z2 | t1 %=% z2 | u1 %=% z2 | v1 %=% z2 | w1 %=% z2 | y1 %=% z2 | z1 %=% x2 | x1 %=% x2 | s1 %=% x2 | t1 %=% x2 | u1 %=% x2 | v1 %=% x2 | w1 %=% x2 | y1 %=% x2 | z1 %=% s2 | x1 %=% s2 | s1 %=% s2 | t1 %=% s2 | u1 %=% s2 | v1 %=% s2 | w1 %=% s2 | y1 %=% s2 | z1 %=% t2 | x1 %=% t2 | s1 %=% t2 | t1 %=% t2 | u1 %=% t2 | v1 %=% t2 | w1 %=% t2 | y1 %=% t2 | z1 %=% u2 | x1 %=% u2 | s1 %=% u2 | t1 %=% u2 | u1 %=% u2 | v1 %=% u2 | w1 %=% u2 | y1 %=% u2 | z1 %=% v2 | x1 %=% v2 | s1 %=% v2 | t1 %=% v2 | u1 %=% v2 | v1 %=% v2 | w1 %=% v2 | y1 %=% v2 | z1 %=% w2 | x1 %=% w2 | s1 %=% w2 | t1 %=% w2 | u1 %=% w2 | v1 %=% w2 | w1 %=% w2 | y1 %=% w2 | z1 %=% y2 | x1 %=% y2 | s1 %=% y2 | t1 %=% y2 | u1 %=% y2 | v1 %=% y2 | w1 %=% y2 | y1 %=% y2)"
呃,这看起来很糟糕,但是每组 8 个变量的性能扩展非常好:
b1e4 <- rbindlist(replicate(1e4, b1, simplify=FALSE))
system.time(b1e4[, result2 := eval(parse(text = bexpr))])
# user system elapsed
# 0.11 0.00 0.10
b1e5 <- rbindlist(replicate(1e5, b1, simplify=FALSE))
system.time(b1e5[, result2 := eval(parse(text = bexpr))])
# user system elapsed
# 1.03 0.00 1.03
b1e6 <- rbindlist(replicate(1e6, b1, simplify=FALSE))
system.time(b1e6[, result2 := eval(parse(text = bexpr))])
# user system elapsed
# 11.72 0.51 12.25