如何删除在 R 中不完全重复的行
how delete rows which are not completely duplicated in R
我有数据示例
第一个
resp=structure(list(person_number = c(914198L, 914198L, 914198L, 914198L,
914198L, 957505L, 957505L, 957505L, 957505L, 957505L, 967216L,
967216L, 967216L, 967216L, 967216L, 27771498L, 27771498L, 27771498L,
27771498L, 27771498L, 957505L, 957505L, 957505L, 914198L, 967216L,
967216L, 914198L, 967216L, 914198L), position_code = c(50000690L,
50000690L, 50000690L, 50000690L, 50000690L, 50000690L, 50000690L,
50000690L, 50000690L, 50000690L, 50000690L, 50000690L, 50000690L,
50000690L, 50000690L, 801L, 801L, 801L, 801L, 801L, 50000690L,
50000690L, 50000690L, 50000690L, 50000690L, 50000690L, 50000690L,
50000690L, 50000690L), date = c(7L, 2L, 1L, 4L, 5L, 6L, 3L, 4L,
5L, 2L, 3L, 5L, 1L, 6L, 7L, 7L, 2L, 6L, 4L, 1L, 6L, 3L, 4L, 1L,
3L, 5L, 4L, 7L, 5L), start_hour = c(9L, 9L, 11L, 9L, 9L, 9L,
9L, 11L, 9L, 9L, 9L, 11L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 12L,
15L, 10L, 9L, 11L, 10L, 11L, 10L, 9L), end_hour = c(21L, 21L,
21L, 15L, 15L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L,
19L, 19L, 19L, 19L, 19L, 21L, 21L, 19L, 21L, 21L, 21L, 21L, 21L,
21L)), class = "data.frame", row.names = c(NA, -29L))
让我举个清楚的例子,这样你就能明白我需要什么帮助。
数据集代表和 person_number = 957505
person_number position_code date start_hour end_hour
957505 50000690 6 9 21
957505 50000690 3 9 21
957505 50000690 4 11 21
957505 50000690 5 9 21
957505 50000690 2 9 21
957505 50000690 6 12 21
957505 50000690 3 15 21
957505 50000690 4 10 19
这里我们看到 date = 6 出现了 2 次,范围是 from 9-21 and from 12-21
我们还看到 date = 4 也出现了 2 次,范围 start-end hours 11-21,11-19
这意味着我需要随机删除具有重复日期但范围不同的观测值。
I.E 我需要删除 date = 6 的任何一个观察值和 date = 4
的任何一个观察值
那样
person_number position_code date start_hour end_hour
957505 50000690 3 9 21
957505 50000690 5 9 21
957505 50000690 2 9 21
957505 50000690 6 12 21
957505 50000690 3 15 21
957505 50000690 4 10 19
但是,也有这样的情况
person_number position_code date start_hour end_hour
957505 50000690 6 9 21
957505 50000690 3 9 21
957505 50000690 4 11 21
957505 50000690 5 9 21
957505 50000690 2 9 21
957505 50000690 6 12 21
957505 50000690 3 15 21
957505 50000690 4 10 19
我们看到,例如,这里 date = 3 有重复的 1 个范围 start_hour end_hour from 9-21, and another 15-21
但此 person_number 的 15-21 范围不再重复,但 9-21
已重复超过 2 次 personal_number
957505 50000690 6 9 21
957505 50000690 3 9 21
957505 50000690 5 9 21
957505 50000690 2 9 21
它在这里出现了 4 次,所以对于日期 = 3,我们删除 9-21。因为 15-21 的范围没有重复 2 次或更多次。它必须离开。
对于未指定的任何其他条件,代码的这一部分适用
这里我们看到date = 6出现了2次,范围是from 9-21 and from 12-21
我们还看到 date = 4 也出现了 2 次,范围 start-end hours 11-21,11-19
这意味着我需要随机删除具有重复日期但范围不同的观测值。
I.E 我需要删除 date = 6 的任何一个观察值和 date = 4 的任何一个观察值
我怎样才能通过这样的条件删除行?
任何帮助表示赞赏。谢谢。
下面是如何使用库 dplyr
:
进行此类过滤的想法
library(dplyr)
# resp2 will contain all rows with at least double dates
multiple_date <- resp %>% count(person_number, date) %>% filter(n>1)
resp2 <- semi_join(resp, multiple_date)
# show all of resp2
resp2
# show difference between resp and resp2
anti_join(resp, resp2)
# compare resp with resp2 specifically for person 957505
resp %>% filter(person_number == 957505)
resp2 %>% filter(person_number == 957505)
# resp3 will contain all rows with at least double hour range
multiple_hour <- resp %>% count(person_number, start_hour, end_hour) %>% filter(n>1)
resp3 <- semi_join(resp, multiple_hour)
# compare resp with resp3 specifically for person 957505
resp3 %>% filter(person_number == 957505)
resp %>% filter(person_number == 957505)
# resp4 will contain all rows that have at least double date and at least double hour range
resp4 <- semi_join(semi_join(resp, resp2), resp3)
# compare resp with resp4 specifically for person 957505
resp4 %>% filter(person_number == 957505)
resp %>% filter(person_number == 957505)
# remove rows that have at least double date and at least double hour range
final <- anti_join(resp, resp4)
# compare resp with final specifically for person 957505
final %>% filter(person_number == 957505)
resp %>% filter(person_number == 957505)
# check how many entries with double date have been left
final %>% count(person_number, date) %>% filter(n>1)
我有数据示例
第一个
resp=structure(list(person_number = c(914198L, 914198L, 914198L, 914198L,
914198L, 957505L, 957505L, 957505L, 957505L, 957505L, 967216L,
967216L, 967216L, 967216L, 967216L, 27771498L, 27771498L, 27771498L,
27771498L, 27771498L, 957505L, 957505L, 957505L, 914198L, 967216L,
967216L, 914198L, 967216L, 914198L), position_code = c(50000690L,
50000690L, 50000690L, 50000690L, 50000690L, 50000690L, 50000690L,
50000690L, 50000690L, 50000690L, 50000690L, 50000690L, 50000690L,
50000690L, 50000690L, 801L, 801L, 801L, 801L, 801L, 50000690L,
50000690L, 50000690L, 50000690L, 50000690L, 50000690L, 50000690L,
50000690L, 50000690L), date = c(7L, 2L, 1L, 4L, 5L, 6L, 3L, 4L,
5L, 2L, 3L, 5L, 1L, 6L, 7L, 7L, 2L, 6L, 4L, 1L, 6L, 3L, 4L, 1L,
3L, 5L, 4L, 7L, 5L), start_hour = c(9L, 9L, 11L, 9L, 9L, 9L,
9L, 11L, 9L, 9L, 9L, 11L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 12L,
15L, 10L, 9L, 11L, 10L, 11L, 10L, 9L), end_hour = c(21L, 21L,
21L, 15L, 15L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L,
19L, 19L, 19L, 19L, 19L, 21L, 21L, 19L, 21L, 21L, 21L, 21L, 21L,
21L)), class = "data.frame", row.names = c(NA, -29L))
让我举个清楚的例子,这样你就能明白我需要什么帮助。
数据集代表和 person_number = 957505
person_number position_code date start_hour end_hour
957505 50000690 6 9 21
957505 50000690 3 9 21
957505 50000690 4 11 21
957505 50000690 5 9 21
957505 50000690 2 9 21
957505 50000690 6 12 21
957505 50000690 3 15 21
957505 50000690 4 10 19
这里我们看到 date = 6 出现了 2 次,范围是 from 9-21 and from 12-21
我们还看到 date = 4 也出现了 2 次,范围 start-end hours 11-21,11-19
这意味着我需要随机删除具有重复日期但范围不同的观测值。
I.E 我需要删除 date = 6 的任何一个观察值和 date = 4
那样
person_number position_code date start_hour end_hour
957505 50000690 3 9 21
957505 50000690 5 9 21
957505 50000690 2 9 21
957505 50000690 6 12 21
957505 50000690 3 15 21
957505 50000690 4 10 19
但是,也有这样的情况
person_number position_code date start_hour end_hour
957505 50000690 6 9 21
957505 50000690 3 9 21
957505 50000690 4 11 21
957505 50000690 5 9 21
957505 50000690 2 9 21
957505 50000690 6 12 21
957505 50000690 3 15 21
957505 50000690 4 10 19
我们看到,例如,这里 date = 3 有重复的 1 个范围 start_hour end_hour from 9-21, and another 15-21
但此 person_number 的 15-21 范围不再重复,但 9-21
已重复超过 2 次 personal_number
957505 50000690 6 9 21
957505 50000690 3 9 21
957505 50000690 5 9 21
957505 50000690 2 9 21
它在这里出现了 4 次,所以对于日期 = 3,我们删除 9-21。因为 15-21 的范围没有重复 2 次或更多次。它必须离开。
对于未指定的任何其他条件,代码的这一部分适用
这里我们看到date = 6出现了2次,范围是from 9-21 and from 12-21
我们还看到 date = 4 也出现了 2 次,范围 start-end hours 11-21,11-19
这意味着我需要随机删除具有重复日期但范围不同的观测值。
I.E 我需要删除 date = 6 的任何一个观察值和 date = 4 的任何一个观察值
我怎样才能通过这样的条件删除行?
任何帮助表示赞赏。谢谢。
下面是如何使用库 dplyr
:
library(dplyr)
# resp2 will contain all rows with at least double dates
multiple_date <- resp %>% count(person_number, date) %>% filter(n>1)
resp2 <- semi_join(resp, multiple_date)
# show all of resp2
resp2
# show difference between resp and resp2
anti_join(resp, resp2)
# compare resp with resp2 specifically for person 957505
resp %>% filter(person_number == 957505)
resp2 %>% filter(person_number == 957505)
# resp3 will contain all rows with at least double hour range
multiple_hour <- resp %>% count(person_number, start_hour, end_hour) %>% filter(n>1)
resp3 <- semi_join(resp, multiple_hour)
# compare resp with resp3 specifically for person 957505
resp3 %>% filter(person_number == 957505)
resp %>% filter(person_number == 957505)
# resp4 will contain all rows that have at least double date and at least double hour range
resp4 <- semi_join(semi_join(resp, resp2), resp3)
# compare resp with resp4 specifically for person 957505
resp4 %>% filter(person_number == 957505)
resp %>% filter(person_number == 957505)
# remove rows that have at least double date and at least double hour range
final <- anti_join(resp, resp4)
# compare resp with final specifically for person 957505
final %>% filter(person_number == 957505)
resp %>% filter(person_number == 957505)
# check how many entries with double date have been left
final %>% count(person_number, date) %>% filter(n>1)