在 R 中施加特殊条件时删除重复项
Remove duplicates while imposing special conditions in R
我有以下小标题(只显示大约 200 万行中的前 10 行):
ID STATUS NUMBER FUNCTION LASTMODIFIED AMOUNT YEAR MONTH DAY
<chr> <chr> <chr> <chr> <dttm> <dbl> <int> <int> <int>
1 oQYKPAsu9j8AAAF APPROVED "008" CREDIT 2022-03-15 15:16:26 2401 2022 3 15
2 hhoKPAs_fjUAAAF APPROVED "101" CREDIT 2022-03-15 15:15:23 959 2022 3 15
3 Ip8KPAsj__4AAAF DENIED "99" LIMIT 2022-03-15 15:14:06 0 2022 3 15
4 wa4KPAstYwIAAAF DENIED "99" LIMIT 2022-03-15 15:13:36 0 2022 3 15
5 GucKPAssdaUAAAF APPROVED "101" LIMIT 2022-03-15 15:13:21 1084 2022 3 15
6 a6AKPAtAsx4AAAF DENIED "101" CREDIT 2022-03-15 15:12:02 699 2022 3 15
7 a6AKPAtAsx4AAAF DENIED "101" CREDIT 2022-03-15 15:12:34 699 2022 3 15
8 F4kKPAss7OAAAAF APPROVED "101" CREDIT 2022-03-15 15:10:25 3167 2022 3 15
9 MK4KPAstiEYAAAF DENIED "99" LIMIT 2022-03-15 15:08:46 0 2022 3 15
10 .nUKPAs.crIAAAF APPROVED NA CREDIT 2022-03-15 15:08:35 58 2022 3 15
这里展示了不同用户在一个网站上的一些操作,每个ID代表一个唯一的客户。我想删除在 x 分钟内发生的重复条目。因此,显然只应保留上面数据中的第 6 行或第 7 行(最好是第一行)。有没有一种巧妙的 tidyverse/dplyr 方法可以做到这一点?
我的第一个想法是忽略 LASTMODIFIED
列并使用 dg &>& filter(!duplicate())
但这不会做我想要的。
假设数据已经按 LASTMODIFIED
排序(至少在每个组内),那么
xseconds <- 600
dat %>%
group_by(across(-LASTMODIFIED)) %>%
filter(c(TRUE, as.numeric(diff(LASTMODIFIED), units="secs") > xseconds)) %>%
ungroup()
# # A tibble: 9 x 9
# ID STATUS NUMBER FUNCTION LASTMODIFIED AMOUNT YEAR MONTH DAY
# <chr> <chr> <int> <chr> <dttm> <int> <int> <int> <int>
# 1 oQYKPAsu9j8AAAF APPROVED 8 CREDIT 2022-03-15 15:16:26 2401 2022 3 15
# 2 hhoKPAs_fjUAAAF APPROVED 101 CREDIT 2022-03-15 15:15:23 959 2022 3 15
# 3 Ip8KPAsj__4AAAF DENIED 99 LIMIT 2022-03-15 15:14:06 0 2022 3 15
# 4 wa4KPAstYwIAAAF DENIED 99 LIMIT 2022-03-15 15:13:36 0 2022 3 15
# 5 GucKPAssdaUAAAF APPROVED 101 LIMIT 2022-03-15 15:13:21 1084 2022 3 15
# 6 a6AKPAtAsx4AAAF DENIED 101 CREDIT 2022-03-15 15:12:02 699 2022 3 15
# 7 F4kKPAss7OAAAAF APPROVED 101 CREDIT 2022-03-15 15:10:25 3167 2022 3 15
# 8 MK4KPAstiEYAAAF DENIED 99 LIMIT 2022-03-15 15:08:46 0 2022 3 15
# 9 .nUKPAs.crIAAAF APPROVED NA CREDIT 2022-03-15 15:08:35 58 2022 3 15
数据
dat <- structure(list(ID = c("oQYKPAsu9j8AAAF", "hhoKPAs_fjUAAAF", "Ip8KPAsj__4AAAF", "wa4KPAstYwIAAAF", "GucKPAssdaUAAAF", "a6AKPAtAsx4AAAF", "a6AKPAtAsx4AAAF", "F4kKPAss7OAAAAF", "MK4KPAstiEYAAAF", ".nUKPAs.crIAAAF"), STATUS = c("APPROVED", "APPROVED", "DENIED", "DENIED", "APPROVED", "DENIED", "DENIED", "APPROVED", "DENIED", "APPROVED"), NUMBER = c(8L, 101L, 99L, 99L, 101L, 101L, 101L, 101L, 99L, NA), FUNCTION = c("CREDIT", "CREDIT", "LIMIT", "LIMIT", "LIMIT", "CREDIT", "CREDIT", "CREDIT", "LIMIT", "CREDIT" ), LASTMODIFIED = structure(c(1647371786, 1647371723, 1647371646, 1647371616, 1647371601, 1647371522, 1647371554, 1647371425, 1647371326, 1647371315), class = c("POSIXct", "POSIXt"), tzone = ""), AMOUNT = c(2401L, 959L, 0L, 0L, 1084L, 699L, 699L, 3167L, 0L, 58L), YEAR = c(2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L), MONTH = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), DAY = c(15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
我有以下小标题(只显示大约 200 万行中的前 10 行):
ID STATUS NUMBER FUNCTION LASTMODIFIED AMOUNT YEAR MONTH DAY
<chr> <chr> <chr> <chr> <dttm> <dbl> <int> <int> <int>
1 oQYKPAsu9j8AAAF APPROVED "008" CREDIT 2022-03-15 15:16:26 2401 2022 3 15
2 hhoKPAs_fjUAAAF APPROVED "101" CREDIT 2022-03-15 15:15:23 959 2022 3 15
3 Ip8KPAsj__4AAAF DENIED "99" LIMIT 2022-03-15 15:14:06 0 2022 3 15
4 wa4KPAstYwIAAAF DENIED "99" LIMIT 2022-03-15 15:13:36 0 2022 3 15
5 GucKPAssdaUAAAF APPROVED "101" LIMIT 2022-03-15 15:13:21 1084 2022 3 15
6 a6AKPAtAsx4AAAF DENIED "101" CREDIT 2022-03-15 15:12:02 699 2022 3 15
7 a6AKPAtAsx4AAAF DENIED "101" CREDIT 2022-03-15 15:12:34 699 2022 3 15
8 F4kKPAss7OAAAAF APPROVED "101" CREDIT 2022-03-15 15:10:25 3167 2022 3 15
9 MK4KPAstiEYAAAF DENIED "99" LIMIT 2022-03-15 15:08:46 0 2022 3 15
10 .nUKPAs.crIAAAF APPROVED NA CREDIT 2022-03-15 15:08:35 58 2022 3 15
这里展示了不同用户在一个网站上的一些操作,每个ID代表一个唯一的客户。我想删除在 x 分钟内发生的重复条目。因此,显然只应保留上面数据中的第 6 行或第 7 行(最好是第一行)。有没有一种巧妙的 tidyverse/dplyr 方法可以做到这一点?
我的第一个想法是忽略 LASTMODIFIED
列并使用 dg &>& filter(!duplicate())
但这不会做我想要的。
假设数据已经按 LASTMODIFIED
排序(至少在每个组内),那么
xseconds <- 600
dat %>%
group_by(across(-LASTMODIFIED)) %>%
filter(c(TRUE, as.numeric(diff(LASTMODIFIED), units="secs") > xseconds)) %>%
ungroup()
# # A tibble: 9 x 9
# ID STATUS NUMBER FUNCTION LASTMODIFIED AMOUNT YEAR MONTH DAY
# <chr> <chr> <int> <chr> <dttm> <int> <int> <int> <int>
# 1 oQYKPAsu9j8AAAF APPROVED 8 CREDIT 2022-03-15 15:16:26 2401 2022 3 15
# 2 hhoKPAs_fjUAAAF APPROVED 101 CREDIT 2022-03-15 15:15:23 959 2022 3 15
# 3 Ip8KPAsj__4AAAF DENIED 99 LIMIT 2022-03-15 15:14:06 0 2022 3 15
# 4 wa4KPAstYwIAAAF DENIED 99 LIMIT 2022-03-15 15:13:36 0 2022 3 15
# 5 GucKPAssdaUAAAF APPROVED 101 LIMIT 2022-03-15 15:13:21 1084 2022 3 15
# 6 a6AKPAtAsx4AAAF DENIED 101 CREDIT 2022-03-15 15:12:02 699 2022 3 15
# 7 F4kKPAss7OAAAAF APPROVED 101 CREDIT 2022-03-15 15:10:25 3167 2022 3 15
# 8 MK4KPAstiEYAAAF DENIED 99 LIMIT 2022-03-15 15:08:46 0 2022 3 15
# 9 .nUKPAs.crIAAAF APPROVED NA CREDIT 2022-03-15 15:08:35 58 2022 3 15
数据
dat <- structure(list(ID = c("oQYKPAsu9j8AAAF", "hhoKPAs_fjUAAAF", "Ip8KPAsj__4AAAF", "wa4KPAstYwIAAAF", "GucKPAssdaUAAAF", "a6AKPAtAsx4AAAF", "a6AKPAtAsx4AAAF", "F4kKPAss7OAAAAF", "MK4KPAstiEYAAAF", ".nUKPAs.crIAAAF"), STATUS = c("APPROVED", "APPROVED", "DENIED", "DENIED", "APPROVED", "DENIED", "DENIED", "APPROVED", "DENIED", "APPROVED"), NUMBER = c(8L, 101L, 99L, 99L, 101L, 101L, 101L, 101L, 99L, NA), FUNCTION = c("CREDIT", "CREDIT", "LIMIT", "LIMIT", "LIMIT", "CREDIT", "CREDIT", "CREDIT", "LIMIT", "CREDIT" ), LASTMODIFIED = structure(c(1647371786, 1647371723, 1647371646, 1647371616, 1647371601, 1647371522, 1647371554, 1647371425, 1647371326, 1647371315), class = c("POSIXct", "POSIXt"), tzone = ""), AMOUNT = c(2401L, 959L, 0L, 0L, 1084L, 699L, 699L, 3167L, 0L, 58L), YEAR = c(2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L), MONTH = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), DAY = c(15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))