根据 2 列的条件删除重复项并操作数据框

Remove duplicates and manipulate dataframe based on conditions from 2 columns

我的数据框如下

+------+-----+----------+--------+
| from | to  | distance | weight |
+------+-----+----------+--------+
|    1 |   8 |        1 |     10 |
|    2 |   6 |        1 |      9 |
|    3 |   4 |        1 |      5 |
|    4 |   5 |        3 |      9 |
|    5 |   6 |        4 |      8 |
|    6 |   2 |        5 |      2 |
|    7 |   8 |        2 |      1 |
|    4 |   3 |        5 |      6 |
|    2 |   1 |        1 |      7 |
|    6 |   8 |        4 |      8 |
|    1 |   7 |        5 |      3 |
|    8 |   4 |        6 |      7 |
|    9 |   5 |        3 |      9 |
|   10 |   3 |        8 |      2 |
+------+-----+----------+--------+

我想根据以下条件顺序过滤数据:

  1. 如果数字出现在 to 列中,则不应在 tofrom 列中重复出现
  2. from 中的数字可以重复,如果其对应的 to 是一个新值并且在 to 列的任何单元格中都不可用
  3. 我想重复此过程,直到 fromto 组合中的所有唯一值在任一列中至少出现一次
  4. 如果 from 列中的数字是新数字,并且其对应的 to 值已存在于任一列中,则将 to 和距离值替换为空白

所以结果 table 看起来如下:

+------+-----+----------+--------+
| from | to  | Distance | weight |
+------+-----+----------+--------+
|    1 |   8 |        1 |     10 |
|    2 |   6 |        1 |      9 |
|    3 |   4 |        1 |      5 |
|    1 |   7 |        5 |      3 |
|    9 |   5 |        3 |      9 |
|   10 |     |          |      2 |
+------+-----+----------+--------+

这是根据 OP 的规则重现预期结果的尝试。

我仍在努力寻找对宽格式数据使用 unique()duplicated() 以及重塑为长格式的解决方案。

但是,这里有一个使用 for 循环的解决方案,它为给定的示例数据集重现预期结果:

library(data.table)
# append row numbers
setDT(DT)[, rn := .I]

# which values appear only once in the `to`` column?
single_to <- DT[, .N, by = to][N == 1L, to]
single_to
[1] 2 1 7
DT[, drop := NA]
for (i in seq_len(nrow(DT))) {
  print(i)
  print(DT[i])
  if (isTRUE(DT$drop[i])) next # row already has been eliminated
  act_to <- DT$to[i]
  # Rule 1: eliminate subsequent rows with repeated value in `to` column  
  DT[rn > i & (to == act_to), drop := TRUE]
  # Rule 1: eliminate subsequent rows with repeated value in `from` column 
  # Rule 2: but keep rows where value is unique in the `to` column  
  DT[rn > i & (from == act_to) & !(to %in% single_to), drop := TRUE]
  DT[i, drop := FALSE]
  print(DT[])
}
[1] 1
   from to distance weight rn drop
1:    1  8        1     10  1   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2    NA
 3:    3  4        1      5  3    NA
 4:    4  5        3      9  4    NA
 5:    5  6        4      8  5    NA
 6:    6  2        5      2  6    NA
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8    NA
 9:    2  1        1      7  9    NA
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11    NA
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13    NA
14:   10  3        8      2 14    NA
[1] 2
   from to distance weight rn drop
1:    2  6        1      9  2   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3    NA
 4:    4  5        3      9  4    NA
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6    NA
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8    NA
 9:    2  1        1      7  9    NA
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11    NA
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13    NA
14:   10  3        8      2 14    NA
[1] 3
   from to distance weight rn drop
1:    3  4        1      5  3   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3 FALSE
 4:    4  5        3      9  4  TRUE
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6    NA
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8  TRUE
 9:    2  1        1      7  9    NA
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11    NA
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13    NA
14:   10  3        8      2 14    NA
[1] 4
   from to distance weight rn drop
1:    4  5        3      9  4 TRUE
[1] 5
   from to distance weight rn drop
1:    5  6        4      8  5 TRUE
[1] 6
   from to distance weight rn drop
1:    6  2        5      2  6   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3 FALSE
 4:    4  5        3      9  4  TRUE
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6 FALSE
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8  TRUE
 9:    2  1        1      7  9    NA
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11    NA
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13    NA
14:   10  3        8      2 14    NA
[1] 7
   from to distance weight rn drop
1:    7  8        2      1  7 TRUE
[1] 8
   from to distance weight rn drop
1:    4  3        5      6  8 TRUE
[1] 9
   from to distance weight rn drop
1:    2  1        1      7  9   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3 FALSE
 4:    4  5        3      9  4  TRUE
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6 FALSE
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8  TRUE
 9:    2  1        1      7  9 FALSE
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11    NA
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13    NA
14:   10  3        8      2 14    NA
[1] 10
   from to distance weight rn drop
1:    6  8        4      8 10 TRUE
[1] 11
   from to distance weight rn drop
1:    1  7        5      3 11   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3 FALSE
 4:    4  5        3      9  4  TRUE
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6 FALSE
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8  TRUE
 9:    2  1        1      7  9 FALSE
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11 FALSE
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13    NA
14:   10  3        8      2 14    NA
[1] 12
   from to distance weight rn drop
1:    8  4        6      7 12 TRUE
[1] 13
   from to distance weight rn drop
1:    9  5        3      9 13   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3 FALSE
 4:    4  5        3      9  4  TRUE
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6 FALSE
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8  TRUE
 9:    2  1        1      7  9 FALSE
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11 FALSE
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13 FALSE
14:   10  3        8      2 14    NA
[1] 14
   from to distance weight rn drop
1:   10  3        8      2 14   NA
    from to distance weight rn  drop
 1:    1  8        1     10  1 FALSE
 2:    2  6        1      9  2 FALSE
 3:    3  4        1      5  3 FALSE
 4:    4  5        3      9  4  TRUE
 5:    5  6        4      8  5  TRUE
 6:    6  2        5      2  6 FALSE
 7:    7  8        2      1  7  TRUE
 8:    4  3        5      6  8  TRUE
 9:    2  1        1      7  9 FALSE
10:    6  8        4      8 10  TRUE
11:    1  7        5      3 11 FALSE
12:    8  4        6      7 12  TRUE
13:    9  5        3      9 13 FALSE
14:   10  3        8      2 14 FALSE

到目前为止的结果与预期结果不同

result <- DT[!(drop)]
result
   from to distance weight rn  drop
1:    1  8        1     10  1 FALSE
2:    2  6        1      9  2 FALSE
3:    3  4        1      5  3 FALSE
4:    6  2        5      2  6 FALSE
5:    2  1        1      7  9 FALSE
6:    1  7        5      3 11 FALSE
7:    9  5        3      9 13 FALSE
8:   10  3        8      2 14 FALSE

第 1 到 3、11、13 和 14 行符合预期结果,但保留第 6 和 9 行,因为值 21to 列。

显然,这种方法需要改进,因为 21 已经分别出现在第 1 行和第 2 行的 from 列中。这些行需要作为重复行删除。

为了删除这些,result 从宽格式重塑为长格式并按行号排序:

ldt <- melt(result, "rn", c("to", "from"))[order(rn)]
ldt
    rn variable value
 1:  1       to     8
 2:  1     from     1
 3:  2       to     6
 4:  2     from     2
 5:  3       to     4
 6:  3     from     3
 7:  6       to     2
 8:  6     from     6
 9:  9       to     1
10:  9     from     2
11: 11       to     7
12: 11     from     1
13: 13       to     5
14: 13     from     9
15: 14       to     3
16: 14     from    10

现在,我们需要确定属于 single_to 值的重复项的行号:

ldt[duplicated(value) & variable == "to" & value %in% single_to]
   rn variable value
1:  6       to     2
2:  9       to     1

这些行被反加入result中删除:

result2 <-
  result[!ldt[duplicated(value) & variable == "to" & value %in% single_to], on = .(rn)]
result2
   from to distance weight rn  drop
1:    1  8        1     10  1 FALSE
2:    2  6        1      9  2 FALSE
3:    3  4        1      5  3 FALSE
4:    1  7        5      3 11 FALSE
5:    9  5        3      9 13 FALSE
6:   10  3        8      2 14 FALSE

现在这几乎符合预期的结果。只需要执行规则 4。为此,使用与以前相同的方法:重塑为长格式,识别行号并连接。但是,这里使用了 update join

ldt2 <- melt(unique(result2, by = "from"), "rn", c("to", "from"))[order(rn)]
result2[ldt2[duplicated(value)], on = .(rn), c("to", "distance") := NA_integer_]
result2
   from to distance weight rn  drop
1:    1  8        1     10  1 FALSE
2:    2  6        1      9  2 FALSE
3:    3  4        1      5  3 FALSE
4:    1  7        5      3 11 FALSE
5:    9  5        3      9 13 FALSE
6:   10 NA       NA      2 14 FALSE

讨论

此解决方案并未声称在编码或执行速度方面是高效的。它只是旨在从给定的样本数据集中再现预期结果。

它需要更多的测试。例如,OP 在规则 3

中要求

I want to repeat this process until all the unique values from the from and to combined appear atleast once in either of the columns

通过实施规则 1 和 2,最终不会检查是否满足此条件。

此外,我相信可能还有其他方法可以实现相同的目标。