在 R 中施加特殊条件时删除重复项

Remove duplicates while imposing special conditions in R

我有以下小标题(只显示大约 200 万行中的前 10 行):

   ID              STATUS     NUMBER  FUNCTION LASTMODIFIED         AMOUNT YEAR   MONTH  DAY  
   <chr>           <chr>      <chr>   <chr>    <dttm>               <dbl>  <int>  <int>  <int>
 1 oQYKPAsu9j8AAAF APPROVED   "008"   CREDIT   2022-03-15 15:16:26  2401   2022   3      15   
 2 hhoKPAs_fjUAAAF APPROVED   "101"   CREDIT   2022-03-15 15:15:23  959    2022   3      15   
 3 Ip8KPAsj__4AAAF DENIED     "99"    LIMIT    2022-03-15 15:14:06  0      2022   3      15   
 4 wa4KPAstYwIAAAF DENIED     "99"    LIMIT    2022-03-15 15:13:36  0      2022   3      15   
 5 GucKPAssdaUAAAF APPROVED   "101"   LIMIT    2022-03-15 15:13:21  1084   2022   3      15   
 6 a6AKPAtAsx4AAAF DENIED     "101"   CREDIT   2022-03-15 15:12:02  699    2022   3      15   
 7 a6AKPAtAsx4AAAF DENIED     "101"   CREDIT   2022-03-15 15:12:34  699    2022   3      15   
 8 F4kKPAss7OAAAAF APPROVED   "101"   CREDIT   2022-03-15 15:10:25  3167   2022   3      15   
 9 MK4KPAstiEYAAAF DENIED     "99"    LIMIT    2022-03-15 15:08:46  0      2022   3      15   
10 .nUKPAs.crIAAAF APPROVED    NA     CREDIT   2022-03-15 15:08:35  58     2022   3      15   

这里展示了不同用户在一个网站上的一些操作,每个ID代表一个唯一的客户。我想删除在 x 分钟内发生的重复条目。因此,显然只应保留上面数据中的第 6 行或第 7 行(最好是第一行)。有没有一种巧妙的 tidyverse/dplyr 方法可以做到这一点?

我的第一个想法是忽略 LASTMODIFIED 列并使用 dg &>& filter(!duplicate()) 但这不会做我想要的。

假设数据已经按 LASTMODIFIED 排序(至少在每个组内),那么

xseconds <- 600
dat %>%
  group_by(across(-LASTMODIFIED)) %>%
  filter(c(TRUE, as.numeric(diff(LASTMODIFIED), units="secs") > xseconds)) %>%
  ungroup()
# # A tibble: 9 x 9
#   ID              STATUS   NUMBER FUNCTION LASTMODIFIED        AMOUNT  YEAR MONTH   DAY
#   <chr>           <chr>     <int> <chr>    <dttm>               <int> <int> <int> <int>
# 1 oQYKPAsu9j8AAAF APPROVED      8 CREDIT   2022-03-15 15:16:26   2401  2022     3    15
# 2 hhoKPAs_fjUAAAF APPROVED    101 CREDIT   2022-03-15 15:15:23    959  2022     3    15
# 3 Ip8KPAsj__4AAAF DENIED       99 LIMIT    2022-03-15 15:14:06      0  2022     3    15
# 4 wa4KPAstYwIAAAF DENIED       99 LIMIT    2022-03-15 15:13:36      0  2022     3    15
# 5 GucKPAssdaUAAAF APPROVED    101 LIMIT    2022-03-15 15:13:21   1084  2022     3    15
# 6 a6AKPAtAsx4AAAF DENIED      101 CREDIT   2022-03-15 15:12:02    699  2022     3    15
# 7 F4kKPAss7OAAAAF APPROVED    101 CREDIT   2022-03-15 15:10:25   3167  2022     3    15
# 8 MK4KPAstiEYAAAF DENIED       99 LIMIT    2022-03-15 15:08:46      0  2022     3    15
# 9 .nUKPAs.crIAAAF APPROVED     NA CREDIT   2022-03-15 15:08:35     58  2022     3    15

数据

dat <- structure(list(ID = c("oQYKPAsu9j8AAAF", "hhoKPAs_fjUAAAF", "Ip8KPAsj__4AAAF", "wa4KPAstYwIAAAF", "GucKPAssdaUAAAF", "a6AKPAtAsx4AAAF", "a6AKPAtAsx4AAAF", "F4kKPAss7OAAAAF", "MK4KPAstiEYAAAF", ".nUKPAs.crIAAAF"), STATUS = c("APPROVED", "APPROVED", "DENIED", "DENIED", "APPROVED", "DENIED", "DENIED", "APPROVED", "DENIED", "APPROVED"), NUMBER = c(8L, 101L, 99L, 99L, 101L, 101L, 101L, 101L, 99L, NA), FUNCTION = c("CREDIT", "CREDIT", "LIMIT", "LIMIT", "LIMIT", "CREDIT", "CREDIT", "CREDIT", "LIMIT", "CREDIT" ), LASTMODIFIED = structure(c(1647371786, 1647371723, 1647371646, 1647371616, 1647371601, 1647371522, 1647371554, 1647371425, 1647371326, 1647371315), class = c("POSIXct", "POSIXt"), tzone = ""), AMOUNT = c(2401L, 959L, 0L, 0L, 1084L, 699L, 699L, 3167L, 0L, 58L), YEAR = c(2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L, 2022L), MONTH = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), DAY = c(15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c("1", "2",  "3", "4", "5", "6", "7", "8", "9", "10"))