根据数据框中的 2 个日期消除重复项

Eliminate duplicates based on 2 dates in a dataframe

我有这个示例数据框:

df <- data.frame(ID = c("5","5","5","5","5","5" ,"5"    ,"5","5","5","5","14","14","14","14" ,"14","14"),
                 Date1= c("22/07/2014","22/07/2014","22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                          ,"22/07/2014"
                          ,"08/11/2016" 
                         , "08/11/2016"
                         , "08/11/2016"
                         , "08/11/2016"
                         , "08/11/2016"
                         , "08/11/2016"),
                 Date2= c("01/01/2011"
                          ,"01/08/2011"
                          ,"01/12/2010"
                          ,"10/11/2015"
                          ,"22/07/2014"
                          ,"01/01/2013"
                          ,"23/04/2014"
                          ,"01/01/2006"
                          ,"01/01/2013"
                          ,"01/10/2012"
                          ,"01/08/2012"
                          ,"14/04/2015"
                          ,"01/10/2008"
                          ,"01/10/2008"
                          ,"14/05/2015"
                          ,"11/04/2015"
                          ,"05/10/2008"),
stringsAsFactors = F)

每个 ID 都重复了几次。我需要得到一个每个 ID 只有 1 行的数据框。如您所见,每个患者在 df$date1 列中只有一个日期,因此 select 每个患者 1 行的条件是:选择 最近的 介于日期 1 和日期 2 之间的日期。

我该怎么做?

谢谢

这里是tidyverse方法。我创建了一个名为 diff_date 的列,它是 Date1Date2 之间的绝对差异。比我过滤每个 ID 最小差异。

library(dplyr)
library(lubridate)
  
  df %>% 
  mutate(
    across(.cols = starts_with("Date"),.fns = dmy),
    diff_date = abs(as.numeric(difftime(Date1,Date2)))
    ) %>% 
  group_by(ID) %>% 
  filter(diff_date == min(diff_date))

# A tibble: 2 x 4
# Groups:   ID [2]
  ID    Date1      Date2      diff_date
  <chr> <date>     <date>         <dbl>
1 5     2014-07-22 2014-07-22         0
2 14    2016-11-08 2015-05-14  47001600

试试下面的基本 R 代码

unique(
  subset(
    df,
    !!ave(
      abs(as.integer(as.Date(Date2, format = "%d/%m/%Y") - as.Date(Date1, format = "%d/%m/%Y"))),
      ID,
      FUN = function(x) x == min(x)
    )
  )
)

你会得到

   ID      Date1      Date2
5   5 22/07/2014 22/07/2014
15 14 08/11/2016 14/05/2015

使用 base R,将 'Date' 列转换为 Date class,order 基于 'ID' 和 abs日期列之间的差异,子集 duplicated 即 'ID' 列上的第一个唯一行

df[2:3] <- lapply(df[2:3], as.Date, format = "%d/%m/%Y")
df1 <- df[with(df, order(ID, abs(as.numeric(Date1) - as.numeric(Date2)))),]
df1[!duplicated(df1$ID),]

-输出

ID      Date1      Date2
15 14 2016-11-08 2015-05-14
5   5 2014-07-22 2014-07-22