根据数据框中的 2 个日期消除重复项
Eliminate duplicates based on 2 dates in a dataframe
我有这个示例数据框:
df <- data.frame(ID = c("5","5","5","5","5","5" ,"5" ,"5","5","5","5","14","14","14","14" ,"14","14"),
Date1= c("22/07/2014","22/07/2014","22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"08/11/2016"
, "08/11/2016"
, "08/11/2016"
, "08/11/2016"
, "08/11/2016"
, "08/11/2016"),
Date2= c("01/01/2011"
,"01/08/2011"
,"01/12/2010"
,"10/11/2015"
,"22/07/2014"
,"01/01/2013"
,"23/04/2014"
,"01/01/2006"
,"01/01/2013"
,"01/10/2012"
,"01/08/2012"
,"14/04/2015"
,"01/10/2008"
,"01/10/2008"
,"14/05/2015"
,"11/04/2015"
,"05/10/2008"),
stringsAsFactors = F)
每个 ID 都重复了几次。我需要得到一个每个 ID 只有 1 行的数据框。如您所见,每个患者在 df$date1 列中只有一个日期,因此 select 每个患者 1 行的条件是:选择 最近的 介于日期 1 和日期 2 之间的日期。
我该怎么做?
谢谢
这里是tidyverse
方法。我创建了一个名为 diff_date
的列,它是 Date1
和 Date2
之间的绝对差异。比我过滤每个 ID
最小差异。
library(dplyr)
library(lubridate)
df %>%
mutate(
across(.cols = starts_with("Date"),.fns = dmy),
diff_date = abs(as.numeric(difftime(Date1,Date2)))
) %>%
group_by(ID) %>%
filter(diff_date == min(diff_date))
# A tibble: 2 x 4
# Groups: ID [2]
ID Date1 Date2 diff_date
<chr> <date> <date> <dbl>
1 5 2014-07-22 2014-07-22 0
2 14 2016-11-08 2015-05-14 47001600
试试下面的基本 R 代码
unique(
subset(
df,
!!ave(
abs(as.integer(as.Date(Date2, format = "%d/%m/%Y") - as.Date(Date1, format = "%d/%m/%Y"))),
ID,
FUN = function(x) x == min(x)
)
)
)
你会得到
ID Date1 Date2
5 5 22/07/2014 22/07/2014
15 14 08/11/2016 14/05/2015
使用 base R
,将 'Date' 列转换为 Date
class,order
基于 'ID' 和 abs
日期列之间的差异,子集 duplicated
即 'ID' 列上的第一个唯一行
df[2:3] <- lapply(df[2:3], as.Date, format = "%d/%m/%Y")
df1 <- df[with(df, order(ID, abs(as.numeric(Date1) - as.numeric(Date2)))),]
df1[!duplicated(df1$ID),]
-输出
ID Date1 Date2
15 14 2016-11-08 2015-05-14
5 5 2014-07-22 2014-07-22
我有这个示例数据框:
df <- data.frame(ID = c("5","5","5","5","5","5" ,"5" ,"5","5","5","5","14","14","14","14" ,"14","14"),
Date1= c("22/07/2014","22/07/2014","22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"22/07/2014"
,"08/11/2016"
, "08/11/2016"
, "08/11/2016"
, "08/11/2016"
, "08/11/2016"
, "08/11/2016"),
Date2= c("01/01/2011"
,"01/08/2011"
,"01/12/2010"
,"10/11/2015"
,"22/07/2014"
,"01/01/2013"
,"23/04/2014"
,"01/01/2006"
,"01/01/2013"
,"01/10/2012"
,"01/08/2012"
,"14/04/2015"
,"01/10/2008"
,"01/10/2008"
,"14/05/2015"
,"11/04/2015"
,"05/10/2008"),
stringsAsFactors = F)
每个 ID 都重复了几次。我需要得到一个每个 ID 只有 1 行的数据框。如您所见,每个患者在 df$date1 列中只有一个日期,因此 select 每个患者 1 行的条件是:选择 最近的 介于日期 1 和日期 2 之间的日期。
我该怎么做?
谢谢
这里是tidyverse
方法。我创建了一个名为 diff_date
的列,它是 Date1
和 Date2
之间的绝对差异。比我过滤每个 ID
最小差异。
library(dplyr)
library(lubridate)
df %>%
mutate(
across(.cols = starts_with("Date"),.fns = dmy),
diff_date = abs(as.numeric(difftime(Date1,Date2)))
) %>%
group_by(ID) %>%
filter(diff_date == min(diff_date))
# A tibble: 2 x 4
# Groups: ID [2]
ID Date1 Date2 diff_date
<chr> <date> <date> <dbl>
1 5 2014-07-22 2014-07-22 0
2 14 2016-11-08 2015-05-14 47001600
试试下面的基本 R 代码
unique(
subset(
df,
!!ave(
abs(as.integer(as.Date(Date2, format = "%d/%m/%Y") - as.Date(Date1, format = "%d/%m/%Y"))),
ID,
FUN = function(x) x == min(x)
)
)
)
你会得到
ID Date1 Date2
5 5 22/07/2014 22/07/2014
15 14 08/11/2016 14/05/2015
使用 base R
,将 'Date' 列转换为 Date
class,order
基于 'ID' 和 abs
日期列之间的差异,子集 duplicated
即 'ID' 列上的第一个唯一行
df[2:3] <- lapply(df[2:3], as.Date, format = "%d/%m/%Y")
df1 <- df[with(df, order(ID, abs(as.numeric(Date1) - as.numeric(Date2)))),]
df1[!duplicated(df1$ID),]
-输出
ID Date1 Date2
15 14 2016-11-08 2015-05-14
5 5 2014-07-22 2014-07-22