根据 2 列的条件删除重复项并操作数据框
Remove duplicates and manipulate dataframe based on conditions from 2 columns
我的数据框如下
+------+-----+----------+--------+
| from | to | distance | weight |
+------+-----+----------+--------+
| 1 | 8 | 1 | 10 |
| 2 | 6 | 1 | 9 |
| 3 | 4 | 1 | 5 |
| 4 | 5 | 3 | 9 |
| 5 | 6 | 4 | 8 |
| 6 | 2 | 5 | 2 |
| 7 | 8 | 2 | 1 |
| 4 | 3 | 5 | 6 |
| 2 | 1 | 1 | 7 |
| 6 | 8 | 4 | 8 |
| 1 | 7 | 5 | 3 |
| 8 | 4 | 6 | 7 |
| 9 | 5 | 3 | 9 |
| 10 | 3 | 8 | 2 |
+------+-----+----------+--------+
我想根据以下条件顺序过滤数据:
- 如果数字出现在
to
列中,则不应在 to
或 from
列中重复出现
from
中的数字可以重复,如果其对应的 to
是一个新值并且在 to
列的任何单元格中都不可用
- 我想重复此过程,直到
from
和 to
组合中的所有唯一值在任一列中至少出现一次
- 如果
from
列中的数字是新数字,并且其对应的 to
值已存在于任一列中,则将 to
和距离值替换为空白
所以结果 table 看起来如下:
+------+-----+----------+--------+
| from | to | Distance | weight |
+------+-----+----------+--------+
| 1 | 8 | 1 | 10 |
| 2 | 6 | 1 | 9 |
| 3 | 4 | 1 | 5 |
| 1 | 7 | 5 | 3 |
| 9 | 5 | 3 | 9 |
| 10 | | | 2 |
+------+-----+----------+--------+
这是根据 OP 的规则重现预期结果的尝试。
我仍在努力寻找对宽格式数据使用 unique()
、duplicated()
以及重塑为长格式的解决方案。
但是,这里有一个使用 for
循环的解决方案,它为给定的示例数据集重现预期结果:
library(data.table)
# append row numbers
setDT(DT)[, rn := .I]
# which values appear only once in the `to`` column?
single_to <- DT[, .N, by = to][N == 1L, to]
single_to
[1] 2 1 7
DT[, drop := NA]
for (i in seq_len(nrow(DT))) {
print(i)
print(DT[i])
if (isTRUE(DT$drop[i])) next # row already has been eliminated
act_to <- DT$to[i]
# Rule 1: eliminate subsequent rows with repeated value in `to` column
DT[rn > i & (to == act_to), drop := TRUE]
# Rule 1: eliminate subsequent rows with repeated value in `from` column
# Rule 2: but keep rows where value is unique in the `to` column
DT[rn > i & (from == act_to) & !(to %in% single_to), drop := TRUE]
DT[i, drop := FALSE]
print(DT[])
}
[1] 1
from to distance weight rn drop
1: 1 8 1 10 1 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 NA
3: 3 4 1 5 3 NA
4: 4 5 3 9 4 NA
5: 5 6 4 8 5 NA
6: 6 2 5 2 6 NA
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 NA
9: 2 1 1 7 9 NA
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 NA
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 NA
14: 10 3 8 2 14 NA
[1] 2
from to distance weight rn drop
1: 2 6 1 9 2 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 NA
4: 4 5 3 9 4 NA
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 NA
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 NA
9: 2 1 1 7 9 NA
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 NA
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 NA
14: 10 3 8 2 14 NA
[1] 3
from to distance weight rn drop
1: 3 4 1 5 3 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 4 5 3 9 4 TRUE
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 NA
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 TRUE
9: 2 1 1 7 9 NA
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 NA
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 NA
14: 10 3 8 2 14 NA
[1] 4
from to distance weight rn drop
1: 4 5 3 9 4 TRUE
[1] 5
from to distance weight rn drop
1: 5 6 4 8 5 TRUE
[1] 6
from to distance weight rn drop
1: 6 2 5 2 6 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 4 5 3 9 4 TRUE
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 FALSE
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 TRUE
9: 2 1 1 7 9 NA
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 NA
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 NA
14: 10 3 8 2 14 NA
[1] 7
from to distance weight rn drop
1: 7 8 2 1 7 TRUE
[1] 8
from to distance weight rn drop
1: 4 3 5 6 8 TRUE
[1] 9
from to distance weight rn drop
1: 2 1 1 7 9 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 4 5 3 9 4 TRUE
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 FALSE
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 TRUE
9: 2 1 1 7 9 FALSE
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 NA
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 NA
14: 10 3 8 2 14 NA
[1] 10
from to distance weight rn drop
1: 6 8 4 8 10 TRUE
[1] 11
from to distance weight rn drop
1: 1 7 5 3 11 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 4 5 3 9 4 TRUE
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 FALSE
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 TRUE
9: 2 1 1 7 9 FALSE
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 FALSE
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 NA
14: 10 3 8 2 14 NA
[1] 12
from to distance weight rn drop
1: 8 4 6 7 12 TRUE
[1] 13
from to distance weight rn drop
1: 9 5 3 9 13 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 4 5 3 9 4 TRUE
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 FALSE
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 TRUE
9: 2 1 1 7 9 FALSE
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 FALSE
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 FALSE
14: 10 3 8 2 14 NA
[1] 14
from to distance weight rn drop
1: 10 3 8 2 14 NA
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 4 5 3 9 4 TRUE
5: 5 6 4 8 5 TRUE
6: 6 2 5 2 6 FALSE
7: 7 8 2 1 7 TRUE
8: 4 3 5 6 8 TRUE
9: 2 1 1 7 9 FALSE
10: 6 8 4 8 10 TRUE
11: 1 7 5 3 11 FALSE
12: 8 4 6 7 12 TRUE
13: 9 5 3 9 13 FALSE
14: 10 3 8 2 14 FALSE
到目前为止的结果与预期结果不同
result <- DT[!(drop)]
result
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 6 2 5 2 6 FALSE
5: 2 1 1 7 9 FALSE
6: 1 7 5 3 11 FALSE
7: 9 5 3 9 13 FALSE
8: 10 3 8 2 14 FALSE
第 1 到 3、11、13 和 14 行符合预期结果,但保留第 6 和 9 行,因为值 2
和 1
在to
列。
显然,这种方法需要改进,因为 2
和 1
已经分别出现在第 1 行和第 2 行的 from
列中。这些行需要作为重复行删除。
为了删除这些,result
从宽格式重塑为长格式并按行号排序:
ldt <- melt(result, "rn", c("to", "from"))[order(rn)]
ldt
rn variable value
1: 1 to 8
2: 1 from 1
3: 2 to 6
4: 2 from 2
5: 3 to 4
6: 3 from 3
7: 6 to 2
8: 6 from 6
9: 9 to 1
10: 9 from 2
11: 11 to 7
12: 11 from 1
13: 13 to 5
14: 13 from 9
15: 14 to 3
16: 14 from 10
现在,我们需要确定属于 single_to
值的重复项的行号:
ldt[duplicated(value) & variable == "to" & value %in% single_to]
rn variable value
1: 6 to 2
2: 9 to 1
这些行被反加入从result
中删除:
result2 <-
result[!ldt[duplicated(value) & variable == "to" & value %in% single_to], on = .(rn)]
result2
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 1 7 5 3 11 FALSE
5: 9 5 3 9 13 FALSE
6: 10 3 8 2 14 FALSE
现在这几乎符合预期的结果。只需要执行规则 4。为此,使用与以前相同的方法:重塑为长格式,识别行号并连接。但是,这里使用了 update join:
ldt2 <- melt(unique(result2, by = "from"), "rn", c("to", "from"))[order(rn)]
result2[ldt2[duplicated(value)], on = .(rn), c("to", "distance") := NA_integer_]
result2
from to distance weight rn drop
1: 1 8 1 10 1 FALSE
2: 2 6 1 9 2 FALSE
3: 3 4 1 5 3 FALSE
4: 1 7 5 3 11 FALSE
5: 9 5 3 9 13 FALSE
6: 10 NA NA 2 14 FALSE
讨论
此解决方案并未声称在编码或执行速度方面是高效的。它只是旨在从给定的样本数据集中再现预期结果。
它需要更多的测试。例如,OP 在规则 3
中要求
I want to repeat this process until all the unique values from the
from and to combined appear atleast once in either of the columns
通过实施规则 1 和 2,最终不会检查是否满足此条件。
此外,我相信可能还有其他方法可以实现相同的目标。
我的数据框如下
+------+-----+----------+--------+
| from | to | distance | weight |
+------+-----+----------+--------+
| 1 | 8 | 1 | 10 |
| 2 | 6 | 1 | 9 |
| 3 | 4 | 1 | 5 |
| 4 | 5 | 3 | 9 |
| 5 | 6 | 4 | 8 |
| 6 | 2 | 5 | 2 |
| 7 | 8 | 2 | 1 |
| 4 | 3 | 5 | 6 |
| 2 | 1 | 1 | 7 |
| 6 | 8 | 4 | 8 |
| 1 | 7 | 5 | 3 |
| 8 | 4 | 6 | 7 |
| 9 | 5 | 3 | 9 |
| 10 | 3 | 8 | 2 |
+------+-----+----------+--------+
我想根据以下条件顺序过滤数据:
- 如果数字出现在
to
列中,则不应在to
或from
列中重复出现 from
中的数字可以重复,如果其对应的to
是一个新值并且在to
列的任何单元格中都不可用- 我想重复此过程,直到
from
和to
组合中的所有唯一值在任一列中至少出现一次 - 如果
from
列中的数字是新数字,并且其对应的to
值已存在于任一列中,则将to
和距离值替换为空白
所以结果 table 看起来如下:
+------+-----+----------+--------+
| from | to | Distance | weight |
+------+-----+----------+--------+
| 1 | 8 | 1 | 10 |
| 2 | 6 | 1 | 9 |
| 3 | 4 | 1 | 5 |
| 1 | 7 | 5 | 3 |
| 9 | 5 | 3 | 9 |
| 10 | | | 2 |
+------+-----+----------+--------+
这是根据 OP 的规则重现预期结果的尝试。
我仍在努力寻找对宽格式数据使用 unique()
、duplicated()
以及重塑为长格式的解决方案。
但是,这里有一个使用 for
循环的解决方案,它为给定的示例数据集重现预期结果:
library(data.table)
# append row numbers
setDT(DT)[, rn := .I]
# which values appear only once in the `to`` column?
single_to <- DT[, .N, by = to][N == 1L, to]
single_to
[1] 2 1 7
DT[, drop := NA]
for (i in seq_len(nrow(DT))) {
print(i)
print(DT[i])
if (isTRUE(DT$drop[i])) next # row already has been eliminated
act_to <- DT$to[i]
# Rule 1: eliminate subsequent rows with repeated value in `to` column
DT[rn > i & (to == act_to), drop := TRUE]
# Rule 1: eliminate subsequent rows with repeated value in `from` column
# Rule 2: but keep rows where value is unique in the `to` column
DT[rn > i & (from == act_to) & !(to %in% single_to), drop := TRUE]
DT[i, drop := FALSE]
print(DT[])
}
[1] 1 from to distance weight rn drop 1: 1 8 1 10 1 NA from to distance weight rn drop 1: 1 8 1 10 1 FALSE 2: 2 6 1 9 2 NA 3: 3 4 1 5 3 NA 4: 4 5 3 9 4 NA 5: 5 6 4 8 5 NA 6: 6 2 5 2 6 NA 7: 7 8 2 1 7 TRUE 8: 4 3 5 6 8 NA 9: 2 1 1 7 9 NA 10: 6 8 4 8 10 TRUE 11: 1 7 5 3 11 NA 12: 8 4 6 7 12 TRUE 13: 9 5 3 9 13 NA 14: 10 3 8 2 14 NA [1] 2 from to distance weight rn drop 1: 2 6 1 9 2 NA from to distance weight rn drop 1: 1 8 1 10 1 FALSE 2: 2 6 1 9 2 FALSE 3: 3 4 1 5 3 NA 4: 4 5 3 9 4 NA 5: 5 6 4 8 5 TRUE 6: 6 2 5 2 6 NA 7: 7 8 2 1 7 TRUE 8: 4 3 5 6 8 NA 9: 2 1 1 7 9 NA 10: 6 8 4 8 10 TRUE 11: 1 7 5 3 11 NA 12: 8 4 6 7 12 TRUE 13: 9 5 3 9 13 NA 14: 10 3 8 2 14 NA [1] 3 from to distance weight rn drop 1: 3 4 1 5 3 NA from to distance weight rn drop 1: 1 8 1 10 1 FALSE 2: 2 6 1 9 2 FALSE 3: 3 4 1 5 3 FALSE 4: 4 5 3 9 4 TRUE 5: 5 6 4 8 5 TRUE 6: 6 2 5 2 6 NA 7: 7 8 2 1 7 TRUE 8: 4 3 5 6 8 TRUE 9: 2 1 1 7 9 NA 10: 6 8 4 8 10 TRUE 11: 1 7 5 3 11 NA 12: 8 4 6 7 12 TRUE 13: 9 5 3 9 13 NA 14: 10 3 8 2 14 NA [1] 4 from to distance weight rn drop 1: 4 5 3 9 4 TRUE [1] 5 from to distance weight rn drop 1: 5 6 4 8 5 TRUE [1] 6 from to distance weight rn drop 1: 6 2 5 2 6 NA from to distance weight rn drop 1: 1 8 1 10 1 FALSE 2: 2 6 1 9 2 FALSE 3: 3 4 1 5 3 FALSE 4: 4 5 3 9 4 TRUE 5: 5 6 4 8 5 TRUE 6: 6 2 5 2 6 FALSE 7: 7 8 2 1 7 TRUE 8: 4 3 5 6 8 TRUE 9: 2 1 1 7 9 NA 10: 6 8 4 8 10 TRUE 11: 1 7 5 3 11 NA 12: 8 4 6 7 12 TRUE 13: 9 5 3 9 13 NA 14: 10 3 8 2 14 NA [1] 7 from to distance weight rn drop 1: 7 8 2 1 7 TRUE [1] 8 from to distance weight rn drop 1: 4 3 5 6 8 TRUE [1] 9 from to distance weight rn drop 1: 2 1 1 7 9 NA from to distance weight rn drop 1: 1 8 1 10 1 FALSE 2: 2 6 1 9 2 FALSE 3: 3 4 1 5 3 FALSE 4: 4 5 3 9 4 TRUE 5: 5 6 4 8 5 TRUE 6: 6 2 5 2 6 FALSE 7: 7 8 2 1 7 TRUE 8: 4 3 5 6 8 TRUE 9: 2 1 1 7 9 FALSE 10: 6 8 4 8 10 TRUE 11: 1 7 5 3 11 NA 12: 8 4 6 7 12 TRUE 13: 9 5 3 9 13 NA 14: 10 3 8 2 14 NA [1] 10 from to distance weight rn drop 1: 6 8 4 8 10 TRUE [1] 11 from to distance weight rn drop 1: 1 7 5 3 11 NA from to distance weight rn drop 1: 1 8 1 10 1 FALSE 2: 2 6 1 9 2 FALSE 3: 3 4 1 5 3 FALSE 4: 4 5 3 9 4 TRUE 5: 5 6 4 8 5 TRUE 6: 6 2 5 2 6 FALSE 7: 7 8 2 1 7 TRUE 8: 4 3 5 6 8 TRUE 9: 2 1 1 7 9 FALSE 10: 6 8 4 8 10 TRUE 11: 1 7 5 3 11 FALSE 12: 8 4 6 7 12 TRUE 13: 9 5 3 9 13 NA 14: 10 3 8 2 14 NA [1] 12 from to distance weight rn drop 1: 8 4 6 7 12 TRUE [1] 13 from to distance weight rn drop 1: 9 5 3 9 13 NA from to distance weight rn drop 1: 1 8 1 10 1 FALSE 2: 2 6 1 9 2 FALSE 3: 3 4 1 5 3 FALSE 4: 4 5 3 9 4 TRUE 5: 5 6 4 8 5 TRUE 6: 6 2 5 2 6 FALSE 7: 7 8 2 1 7 TRUE 8: 4 3 5 6 8 TRUE 9: 2 1 1 7 9 FALSE 10: 6 8 4 8 10 TRUE 11: 1 7 5 3 11 FALSE 12: 8 4 6 7 12 TRUE 13: 9 5 3 9 13 FALSE 14: 10 3 8 2 14 NA [1] 14 from to distance weight rn drop 1: 10 3 8 2 14 NA from to distance weight rn drop 1: 1 8 1 10 1 FALSE 2: 2 6 1 9 2 FALSE 3: 3 4 1 5 3 FALSE 4: 4 5 3 9 4 TRUE 5: 5 6 4 8 5 TRUE 6: 6 2 5 2 6 FALSE 7: 7 8 2 1 7 TRUE 8: 4 3 5 6 8 TRUE 9: 2 1 1 7 9 FALSE 10: 6 8 4 8 10 TRUE 11: 1 7 5 3 11 FALSE 12: 8 4 6 7 12 TRUE 13: 9 5 3 9 13 FALSE 14: 10 3 8 2 14 FALSE
到目前为止的结果与预期结果不同
result <- DT[!(drop)]
result
from to distance weight rn drop 1: 1 8 1 10 1 FALSE 2: 2 6 1 9 2 FALSE 3: 3 4 1 5 3 FALSE 4: 6 2 5 2 6 FALSE 5: 2 1 1 7 9 FALSE 6: 1 7 5 3 11 FALSE 7: 9 5 3 9 13 FALSE 8: 10 3 8 2 14 FALSE
第 1 到 3、11、13 和 14 行符合预期结果,但保留第 6 和 9 行,因为值 2
和 1
在to
列。
显然,这种方法需要改进,因为 2
和 1
已经分别出现在第 1 行和第 2 行的 from
列中。这些行需要作为重复行删除。
为了删除这些,result
从宽格式重塑为长格式并按行号排序:
ldt <- melt(result, "rn", c("to", "from"))[order(rn)]
ldt
rn variable value 1: 1 to 8 2: 1 from 1 3: 2 to 6 4: 2 from 2 5: 3 to 4 6: 3 from 3 7: 6 to 2 8: 6 from 6 9: 9 to 1 10: 9 from 2 11: 11 to 7 12: 11 from 1 13: 13 to 5 14: 13 from 9 15: 14 to 3 16: 14 from 10
现在,我们需要确定属于 single_to
值的重复项的行号:
ldt[duplicated(value) & variable == "to" & value %in% single_to]
rn variable value 1: 6 to 2 2: 9 to 1
这些行被反加入从result
中删除:
result2 <-
result[!ldt[duplicated(value) & variable == "to" & value %in% single_to], on = .(rn)]
result2
from to distance weight rn drop 1: 1 8 1 10 1 FALSE 2: 2 6 1 9 2 FALSE 3: 3 4 1 5 3 FALSE 4: 1 7 5 3 11 FALSE 5: 9 5 3 9 13 FALSE 6: 10 3 8 2 14 FALSE
现在这几乎符合预期的结果。只需要执行规则 4。为此,使用与以前相同的方法:重塑为长格式,识别行号并连接。但是,这里使用了 update join:
ldt2 <- melt(unique(result2, by = "from"), "rn", c("to", "from"))[order(rn)]
result2[ldt2[duplicated(value)], on = .(rn), c("to", "distance") := NA_integer_]
result2
from to distance weight rn drop 1: 1 8 1 10 1 FALSE 2: 2 6 1 9 2 FALSE 3: 3 4 1 5 3 FALSE 4: 1 7 5 3 11 FALSE 5: 9 5 3 9 13 FALSE 6: 10 NA NA 2 14 FALSE
讨论
此解决方案并未声称在编码或执行速度方面是高效的。它只是旨在从给定的样本数据集中再现预期结果。
它需要更多的测试。例如,OP 在规则 3
中要求I want to repeat this process until all the unique values from the from and to combined appear atleast once in either of the columns
通过实施规则 1 和 2,最终不会检查是否满足此条件。
此外,我相信可能还有其他方法可以实现相同的目标。