如何比较R中的两行日期并转换为一行
How to compare two rows of dates in R and transform into one row
我有一个数据集,其中包含多行相同名称的日期信息(间隔),应该进行比较并最终转换为一行。我想实现以下目标:
- 如果区间有重叠,则保留一行四个值中最早和最晚的日期
- 如果时间间隔不重叠,但时间间隔小于或等于60天,同上:这样,四个值中最早和最晚的日期保留一行
- 如果间隔不重叠,且间隔时间超过 60 天,则什么也不做(保留两行)
数据:
names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)
想要的结果:
names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "16-4-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)
转换日期:
df1$date1 <- as.Date(df1$date1, "%d-%m-%Y")
df1$date2 <- as.Date(df1$date2, "%d-%m-%Y")
这是使用 dplyr
的一种方法(可能不是最简洁的)。首先我们将日期转换为 Date
格式,然后对于每个名称
- 确定第二个间隔是否在第一个间隔后 60 天以上开始。如果是这样,我们将这两行标记为
keep_both
。我们对日期进行了排序,因此我们知道第二行稍后出现。
- 对于未标记的行
keep_both
,获取最小和最大日期。请注意,我假设间隔顺序正确,即此处每一行的 date2
晚于 date1
。
- 过滤数据以仅保留每个姓名的第一行,除非我们保留两者。
除了 Rick 的拼写错误外,输出与您想要的输出相符。
names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names, date1, date2)
library(tidyverse)
df1 %>%
mutate(across(c(date1, date2), lubridate::dmy)) %>%
arrange(names, date1, date2) %>%
group_by(names) %>%
mutate(
keep_both = any((date1 - lag(date2)) > 60, na.rm = TRUE),
new_date1 = if_else(keep_both, date1, min(date1)),
new_date2 = if_else(keep_both, date2, max(date2)),
) %>%
filter(keep_both | row_number() == 1) %>%
select(names, date1 = new_date1, date2 = new_date2)
#> # A tibble: 5 x 3
#> # Groups: names [4]
#> names date1 date2
#> <chr> <date> <date>
#> 1 Harry 2018-08-27 2019-06-27
#> 2 Harry 2020-02-04 2020-04-08
#> 3 John 2016-03-01 2020-04-16
#> 4 Katie 2019-01-05 2020-04-10
#> 5 Rick 2018-01-13 2020-03-02
由 reprex package (v0.3.0)
于 2020-07-13 创建
我使用了稍微改动过的示例数据,以确保间隔 <= 60 天,按照问题中的描述进行连接..
示例数据
names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "28-4-2020", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "28-5-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)
names date1 date2
1 John 1-3-2016 16-4-2020
2 John 28-4-2020 28-5-2020 # !! <-- altered so interval-gap with line 1 <= 60 days
3 Rick 13-1-2018 2-3-2020
4 Rick 4-2-2020 16-2-2020
5 Katie 5-1-2019 25-2-2020
6 Katie 29-1-2020 10-4-2020
7 Harry 27-8-2018 27-6-2019
8 Harry 4-2-2020 8-4-2020
names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("28-5-2020", "2-3-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)
names date1 date2
1 John 1-3-2016 28-5-2020 # !! <-- joined, since gap <= 60 days
2 Rick 13-1-2018 2-3-2020 # !! <-- fixed type in your sample data provided
3 Katie 5-1-2019 10-4-2020
4 Harry 27-8-2018 27-6-2019
5 Harry 4-2-2020 8-4-2020
代码
该代码使用了 data.table
和 intervals
包。由于 intervals
仅适用于数字(整数或实数)间隔,因此日期列在 interval-creation/-extension/-merging 之前转换为数字,并在处理后返回日期格式。
下面代码的作用:
- Loop (
lapply()
over cuncks of split (using data.table::split()
with the by-argument
. , by name. keep.by = FALSE
is used, 因为我们不需要它,并且别名也存储在创建列表的名称中。
- 对于每个 cunck (=name),根据两个日期列定义间隔,并将这些间隔延长
gap
变量值的一半(在下面的代码中设置为 60)。所以每个间隔前后延长30天
然后加入 overlapping/touching(扩展)间隔,最后删除扩展。
- 使用
data.table::rbindlist()
,将所有结果重新合并在一起。
- 设置列名,并将数值日期转换回数据格式
.
library( data.table )
library( intervals )
#set maximum gap between intervals
gap = 60
#set data to data.table format
setDT(df1)
#set dates to numeric (required by the intervals-package)
df1[, c("date1", "date2") := lapply( .SD, as.numeric ), .SDcols = c("date1", "date2") ]
#where the magic happens (see text above for explanation )
ans <- data.table::rbindlist(
lapply( split( df1 , by = "names", keep.by = FALSE ), function(x) {
as.data.table(
intervals::close_intervals( intervals::contract( intervals::reduce( intervals::expand(
intervals::Intervals( x, type = "Z" ),
gap/2 ) ), gap/2 )
)
)
}),
use.names = TRUE, idcol = "name" )
#use names from df1
setnames( ans, names(ans), names(df1) )
#set numeric back to date
ans[, c("date1", "date2") := lapply( .SD, as.Date, origin = "1970-01-01" ), .SDcols = c("date1", "date2") ]
输出
names date1 date2
1: John 2016-03-01 2020-05-28
2: Rick 2018-01-13 2020-03-02
3: Katie 2019-01-05 2020-04-10
4: Harry 2018-08-27 2019-06-27
5: Harry 2020-02-04 2020-04-08
我有一个数据集,其中包含多行相同名称的日期信息(间隔),应该进行比较并最终转换为一行。我想实现以下目标:
- 如果区间有重叠,则保留一行四个值中最早和最晚的日期
- 如果时间间隔不重叠,但时间间隔小于或等于60天,同上:这样,四个值中最早和最晚的日期保留一行
- 如果间隔不重叠,且间隔时间超过 60 天,则什么也不做(保留两行)
数据:
names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)
想要的结果:
names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "16-4-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)
转换日期:
df1$date1 <- as.Date(df1$date1, "%d-%m-%Y")
df1$date2 <- as.Date(df1$date2, "%d-%m-%Y")
这是使用 dplyr
的一种方法(可能不是最简洁的)。首先我们将日期转换为 Date
格式,然后对于每个名称
- 确定第二个间隔是否在第一个间隔后 60 天以上开始。如果是这样,我们将这两行标记为
keep_both
。我们对日期进行了排序,因此我们知道第二行稍后出现。 - 对于未标记的行
keep_both
,获取最小和最大日期。请注意,我假设间隔顺序正确,即此处每一行的date2
晚于date1
。 - 过滤数据以仅保留每个姓名的第一行,除非我们保留两者。
除了 Rick 的拼写错误外,输出与您想要的输出相符。
names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "18-5-2016", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "13-2-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names, date1, date2)
library(tidyverse)
df1 %>%
mutate(across(c(date1, date2), lubridate::dmy)) %>%
arrange(names, date1, date2) %>%
group_by(names) %>%
mutate(
keep_both = any((date1 - lag(date2)) > 60, na.rm = TRUE),
new_date1 = if_else(keep_both, date1, min(date1)),
new_date2 = if_else(keep_both, date2, max(date2)),
) %>%
filter(keep_both | row_number() == 1) %>%
select(names, date1 = new_date1, date2 = new_date2)
#> # A tibble: 5 x 3
#> # Groups: names [4]
#> names date1 date2
#> <chr> <date> <date>
#> 1 Harry 2018-08-27 2019-06-27
#> 2 Harry 2020-02-04 2020-04-08
#> 3 John 2016-03-01 2020-04-16
#> 4 Katie 2019-01-05 2020-04-10
#> 5 Rick 2018-01-13 2020-03-02
由 reprex package (v0.3.0)
于 2020-07-13 创建我使用了稍微改动过的示例数据,以确保间隔 <= 60 天,按照问题中的描述进行连接..
示例数据
names <- c("John", "John", "Rick", "Rick", "Katie", "Katie", "Harry", "Harry" )
date1 <- c("1-3-2016", "28-4-2020", "13-1-2018", "4-2-2020", "5-1-2019", "29-1-2020", "27-8-2018", "4-2-2020")
date2 <- c("16-4-2020", "28-5-2020", "2-3-2020", "16-2-2020", "25-2-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df1 <- data.frame(names,date1,date2)
names date1 date2
1 John 1-3-2016 16-4-2020
2 John 28-4-2020 28-5-2020 # !! <-- altered so interval-gap with line 1 <= 60 days
3 Rick 13-1-2018 2-3-2020
4 Rick 4-2-2020 16-2-2020
5 Katie 5-1-2019 25-2-2020
6 Katie 29-1-2020 10-4-2020
7 Harry 27-8-2018 27-6-2019
8 Harry 4-2-2020 8-4-2020
names <- c("John", "Rick", "Katie", "Harry", "Harry")
date1 <- c("1-3-2016", "13-1-2018", "5-1-2019", "27-8-2018", "4-2-2020")
date2 <- c("28-5-2020", "2-3-2020", "10-4-2020", "27-6-2019", "8-4-2020")
df2 <- data.frame(names,date1,date2)
names date1 date2
1 John 1-3-2016 28-5-2020 # !! <-- joined, since gap <= 60 days
2 Rick 13-1-2018 2-3-2020 # !! <-- fixed type in your sample data provided
3 Katie 5-1-2019 10-4-2020
4 Harry 27-8-2018 27-6-2019
5 Harry 4-2-2020 8-4-2020
代码
该代码使用了 data.table
和 intervals
包。由于 intervals
仅适用于数字(整数或实数)间隔,因此日期列在 interval-creation/-extension/-merging 之前转换为数字,并在处理后返回日期格式。
下面代码的作用:
- Loop (
lapply()
over cuncks of split (usingdata.table::split()
with theby-argument
. , by name.keep.by = FALSE
is used, 因为我们不需要它,并且别名也存储在创建列表的名称中。 - 对于每个 cunck (=name),根据两个日期列定义间隔,并将这些间隔延长
gap
变量值的一半(在下面的代码中设置为 60)。所以每个间隔前后延长30天
然后加入 overlapping/touching(扩展)间隔,最后删除扩展。 - 使用
data.table::rbindlist()
,将所有结果重新合并在一起。 - 设置列名,并将数值日期转换回数据格式
.
library( data.table )
library( intervals )
#set maximum gap between intervals
gap = 60
#set data to data.table format
setDT(df1)
#set dates to numeric (required by the intervals-package)
df1[, c("date1", "date2") := lapply( .SD, as.numeric ), .SDcols = c("date1", "date2") ]
#where the magic happens (see text above for explanation )
ans <- data.table::rbindlist(
lapply( split( df1 , by = "names", keep.by = FALSE ), function(x) {
as.data.table(
intervals::close_intervals( intervals::contract( intervals::reduce( intervals::expand(
intervals::Intervals( x, type = "Z" ),
gap/2 ) ), gap/2 )
)
)
}),
use.names = TRUE, idcol = "name" )
#use names from df1
setnames( ans, names(ans), names(df1) )
#set numeric back to date
ans[, c("date1", "date2") := lapply( .SD, as.Date, origin = "1970-01-01" ), .SDcols = c("date1", "date2") ]
输出
names date1 date2
1: John 2016-03-01 2020-05-28
2: Rick 2018-01-13 2020-03-02
3: Katie 2019-01-05 2020-04-10
4: Harry 2018-08-27 2019-06-27
5: Harry 2020-02-04 2020-04-08