如何比较R中的两行日期并转换为一行

How to compare two rows of dates in R and transform into one row

我有一个数据集,其中包含多行相同名称的日期信息(间隔),应该进行比较并最终转换为一行。我想实现以下目标:

数据:

names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)

想要的结果:

names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "16-4-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)

转换日期:

df1$date1 <- as.Date(df1$date1, "%d-%m-%Y")
df1$date2 <- as.Date(df1$date2, "%d-%m-%Y")

这是使用 dplyr 的一种方法(可能不是最简洁的)。首先我们将日期转换为 Date 格式,然后对于每个名称

  1. 确定第二个间隔是否在第一个间隔后 60 天以上开始。如果是这样,我们将这两行标记为 keep_both。我们对日期进行了排序,因此我们知道第二行稍后出现。
  2. 对于未标记的行 keep_both,获取最小和最大日期。请注意,我假设间隔顺序正确,即此处每一行的 date2 晚于 date1
  3. 过滤数据以仅保留每个姓名的第一行,除非我们保留两者。

除了 Rick 的拼写错误外,输出与您想要的输出相符。

names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names, date1, date2)

library(tidyverse)

df1 %>%
  mutate(across(c(date1, date2), lubridate::dmy)) %>%
  arrange(names, date1, date2) %>%
  group_by(names) %>%
  mutate(
    keep_both = any((date1 - lag(date2)) > 60, na.rm = TRUE),
    new_date1 = if_else(keep_both, date1, min(date1)),
    new_date2 = if_else(keep_both, date2, max(date2)),
  ) %>%
  filter(keep_both | row_number() == 1) %>%
  select(names, date1 = new_date1, date2 = new_date2)
#> # A tibble: 5 x 3
#> # Groups:   names [4]
#>   names date1      date2     
#>   <chr> <date>     <date>    
#> 1 Harry 2018-08-27 2019-06-27
#> 2 Harry 2020-02-04 2020-04-08
#> 3 John  2016-03-01 2020-04-16
#> 4 Katie 2019-01-05 2020-04-10
#> 5 Rick  2018-01-13 2020-03-02

reprex package (v0.3.0)

于 2020-07-13 创建

我使用了稍微改动过的示例数据,以确保间隔 <= 60 天,按照问题中的描述进行连接..

示例数据

names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "28-4-2020", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "28-5-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)

  names     date1     date2
1  John  1-3-2016 16-4-2020
2  John 28-4-2020 28-5-2020 # !! <-- altered so interval-gap with line 1 <= 60 days
3  Rick 13-1-2018  2-3-2020
4  Rick  4-2-2020 16-2-2020
5 Katie  5-1-2019 25-2-2020
6 Katie 29-1-2020 10-4-2020
7 Harry 27-8-2018 27-6-2019
8 Harry  4-2-2020  8-4-2020

names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("28-5-2020", "2-3-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)

  names     date1     date2
1  John  1-3-2016 28-5-2020  # !! <-- joined, since gap <= 60 days
2  Rick 13-1-2018  2-3-2020  # !! <-- fixed type in your sample data provided
3 Katie  5-1-2019 10-4-2020
4 Harry 27-8-2018 27-6-2019
5 Harry  4-2-2020  8-4-2020

代码

该代码使用了 data.tableintervals 包。由于 intervals 仅适用于数字(整数或实数)间隔,因此日期列在 interval-creation/-extension/-merging 之前转换为数字,并在处理后返回日期格式。

下面代码的作用:

  1. Loop (lapply() over cuncks of split (using data.table::split() with the by-argument. , by name. keep.by = FALSE is used, 因为我们不需要它,并且别名也存储在创建列表的名称中。
  2. 对于每个 cunck (=name),根据两个日期列定义间隔,并将这些间隔延长 gap 变量值的一半(在下面的代码中设置为 60)。所以每个间隔前后延长30天
    然后加入 overlapping/touching(扩展)间隔,最后删除扩展。
  3. 使用 data.table::rbindlist(),将所有结果重新合并在一起。
  4. 设置列名,并将数值日期转换回数据格式

.

library( data.table )
library( intervals )
#set maximum gap between intervals
gap = 60
#set data to data.table format
setDT(df1)
#set dates to numeric (required by the intervals-package)
df1[, c("date1", "date2") := lapply( .SD, as.numeric ), .SDcols = c("date1", "date2") ]
#where the magic happens (see text above for explanation )
ans <- data.table::rbindlist(
  lapply( split( df1 , by = "names", keep.by = FALSE ), function(x) {
    as.data.table(
    intervals::close_intervals( intervals::contract( intervals::reduce( intervals::expand( 
            intervals::Intervals( x, type = "Z" ), 
            gap/2 ) ), gap/2 ) 
      )
    )
  }),
  use.names = TRUE, idcol = "name" )
#use names from df1
setnames( ans, names(ans), names(df1) )
#set numeric back to date
ans[, c("date1", "date2") := lapply( .SD, as.Date, origin = "1970-01-01" ), .SDcols = c("date1", "date2") ]

输出

   names      date1      date2
1:  John 2016-03-01 2020-05-28
2:  Rick 2018-01-13 2020-03-02
3: Katie 2019-01-05 2020-04-10
4: Harry 2018-08-27 2019-06-27
5: Harry 2020-02-04 2020-04-08