根据两个 ID 变量标记重复的 obs

Question

更新示例（参见规则）

我有 data.table id1 和 id2 列（如下所示）

data.table(id1=c(1,1,2,3,3,3,4), id2=c(1,2,2,1,2,3,2))

id1	id2
1	1
1	2
2	2
3	1
3	2
3	3
4	2

我想生成一个标志来标识 id1 和 id2 之间的重复关联。

RULE ：如果一个特定的 id1 已经与 id2 相关联，那么它应该被标记..一个唯一的 id2 应该只与一个 id1 相关联（见下面的解释）

a) 寻找有效的解决方案和 b) 仅使用基础知识和 data.table 函数的解决方案

id1	id2	flag
1	1
1	2	Y	<== since id2=1is assicated with id1=1 in 1st row
2	2
3	1	Y	<== since id2=1 is assicated with id1=1 in 1st row
3	2	Y	<== since id2=2 is assicated with id1=2 in 3rd row
3	3
4	2	Y	<== since id2=2 is assicated with id1=2 in 3rd row

Answer 1

# replicate your data
df <- data.table(id1=c(1,1,2,3,3,3,4), id2=c(1,2,2,1,2,3,2))


# create and append a new, empty column that will late be filled with the info whether they match or not
empty_col <- rep(NA, nrow(df)) #create empty vector
df[ , `:=` (duplicate = empty_col)] #append to existing data set

# create loop, to iteratively check if statement is true, then fill the new column accordingly  
# note that instead of indexing the columns (e.g. df[,3] you could also use their names (e.g. df$flag)  


for (i in 1:nrow(df)){
  if (i>=2 & df[i,1] == df[1:i-1,1]){ #check if value in loop I matches that of any rows before (1:i)
    df[i,3] = 1 
  }
  else {
    df[i,3] = 0 # when they dont match: 0, false
  }
}

# note that the loop only starts in the second row, as i-1 would be 0 for i=1, such that there would be an error.

Answer 2

这是一个非常复杂的链条，但我认为它产生了结果（你问题中的结果不符合你自己的逻辑）：

library(data.table)
a = data.table(id1=c(1,1,2,3,3,3,4), id2=c(1,2,2,1,2,3,2))

a[, .SD[1, ], by = id2][, 
                       Noflag := "no"][a, 
                                       on = .(id2, id1)][is.na(Noflag), 
                                                         flag := "y"][,
                                                                      Noflag := NULL][]

里面有什么：

a[, .SD[1, ], by = id2] 通过 id2 获取子组的每个第一行。此群组 不应被标记 ，因此
[, Noflag := "no"] 将它们标记为“未标记”（请看图。我说这很复杂）。我们需要将这个没有标记的 table 与原来的
[a, on = .(id2, id1)] 在 id1 和 id2 上将最后一个 table 与原来的 a 连接起来。现在我们需要将未标记的行标记为“不应标记”：
[is.na(Noflag), flag := "y"]。最后一部分是删除 Noflag 不必要的列：
[, Noflag := NULL] 并添加一个 [] 以在屏幕上显示新的 table。

我同意@akrun 的评论，认为 igraph 不仅效率更高，而且语法更简单。

Answer 3

这是一个棘手的问题。如果我没理解错的话，我对OP的规则翻译如下：

对于每个 id1 组，只有一行未标记。
如果 id1 组仅包含一行，则它未标记。
在 id1 组中，所有 id2 已在先前组中使用的行都被标记。
如果一个 id1 组中有超过一行到现在还没有被标记，只有第一行没有被标记；所有其他行都被标记。

所以，方法是

创建可用 id2 值的向量，
逐步完成 id1 个组，
- 在每个组中找到第一行，其中 id2 值尚未在之前的组中使用，
- 标记所有其他行，
- 并更新可用（未消耗）id2 值的向量。

avail <- unique(DT$id2)
DT[, flag := {
  idx <- max(first(which(id2 %in% avail)), 1L)
  avail <- setdiff(avail, id2)
  replace(rep("Y", .N), idx, "")
}, by = id1][]

   id1 id2 flag
1:   1   1     
2:   1   2    Y
3:   2   2     
4:   3   1    Y
5:   3   2    Y
6:   3   3     
7:   4   2

警告

以上代码再现了 OP 提供的用例的预期结果。但是，可能还有其他用例 and/or 边缘情况，其中可能需要调整代码以符合 OP 的期望。例如，不清楚 id1 组的预期结果是什么，其中 all id2 值已经在先前的组中使用。

编辑：

OP 已经编辑了预期结果，因此第 7 行现在也被标记了。

这是我的代码的一个调整版本，它在编辑后重现了预期的结果：

avail <- unique(DT$id2)
DT[, flag := {
  idx <- first(which(id2 %in% avail))
  avail <- setdiff(avail, id2[idx])
  replace(rep("Y", .N), idx, "")
}, by = id1][]

   id1 id2 flag
1:   1   1     
2:   1   2    Y
3:   2   2     
4:   3   1    Y
5:   3   2    Y
6:   3   3     
7:   4   2    Y

数据

library(data.table)
DT = data.table(id1 = c(1, 1, 2, 3, 3, 3, 4),
                id2 = c(1, 2, 2, 1, 2, 3, 2))

根据两个 ID 变量标记重复的 obs

Flag duplicate obs between based on two ID variables

r

data.table

警告

编辑：

数据