识别分组数据的重叠日期间隔
Identifying overlapping date intervals on grouped data
我有一个大型数据集,我想在其中识别时间重叠的观察结果 space。每个观察都有一个唯一的 ID,我已经确定了 space 中重叠的那些,由 overlap_space
给出。现在,我想检查 space 中重叠的观测值 start/end-points 是否也重叠。
下面给出一个简单的例子:
start <- c("2007-06-27", "2010-06-30", "2015-01-01", "2012-01-01", "2010-01-01", "2009-01-01")
end <- c("2008-10-01", "2010-07-01", "2017-02-02", "2013-01-01", "2010-07-03", "2012-01-01")
df <- data.frame(id = c(1:6),
start = as.Date(start, format = "%Y-%m-%d"),
end = as.Date(end, format = "%Y-%m-%d"),
overlap_id = as.character(c("2, 4", "1, 3, 5", "2, 5", "1, 5, 6", "2, 3, 4", "4")))```
> df
id start end overlap_id
1 1 2007-06-27 2008-10-01 2, 4
2 2 2010-06-30 2010-07-01 1, 3, 5
3 3 2015-01-01 2017-02-02 2, 5
4 4 2012-01-01 2013-01-01 1, 5, 6
5 5 2010-01-01 2010-07-03 2, 3, 4
6 6 2009-01-01 2012-01-01 4
并非所有空间重叠的 ID 在时间上都重叠。我如何识别那些做的?换句话说,我需要在 overlap_id
上进行匹配(可以使用 tidyr::separate_rows(overlap_id)
以及 start/end 日期将其制作成更长的格式。我尝试使用 lubridate::interval
,但我无法确保重叠仅限于 overlap_id
.
中确定的重叠
这是我想要的输出:
> df
id start end overlap_id time_overlap overlap_dummy
1 1 2007-06-27 2008-10-01 2, 4 NA 0
2 2 2010-06-30 2010-07-01 1, 3, 5 5 1
3 3 2015-01-01 2017-02-02 2, 5 NA 0
4 4 2012-01-01 2013-01-01 1, 5, 6 6 1
5 5 2010-01-01 2010-07-03 2, 3, 4 2 1
6 6 2009-01-01 2012-01-01 4 6 1
如有任何帮助,我们将不胜感激!谢谢。
这是一种方法...
library(lubridate)
library(tidyverse)
df2 <- df %>% separate_rows(overlap_id, convert = TRUE) %>% #spread rows
left_join(df %>% select(-overlap_id) %>% #add dates for overlap rows
rename(start1 = start, #avoid name clash on join
end1 = end),
by = c("overlap_id" = "id")) %>%
filter(int_overlaps(interval(start, end), #delete non-overlapping
interval(start1, end1))) %>%
group_by(id) %>%
summarise(time_overlap = paste(overlap_id, sep = ", ")) %>% #paste overlaps if more than 1
right_join(df) #merge back to df (by id)
df2
# A tibble: 6 x 5
id time_overlap start end overlap_id
<int> <chr> <date> <date> <fct>
1 1 NA 2007-06-27 2008-10-01 2, 4
2 2 5 2010-06-30 2010-07-01 1, 3, 5
3 3 NA 2015-01-01 2017-02-02 2, 5
4 4 6 2012-01-01 2013-01-01 1, 5, 6
5 5 2 2010-01-01 2010-07-03 2, 3, 4
6 6 4 2009-01-01 2012-01-01 4
我有一个大型数据集,我想在其中识别时间重叠的观察结果 space。每个观察都有一个唯一的 ID,我已经确定了 space 中重叠的那些,由 overlap_space
给出。现在,我想检查 space 中重叠的观测值 start/end-points 是否也重叠。
下面给出一个简单的例子:
start <- c("2007-06-27", "2010-06-30", "2015-01-01", "2012-01-01", "2010-01-01", "2009-01-01")
end <- c("2008-10-01", "2010-07-01", "2017-02-02", "2013-01-01", "2010-07-03", "2012-01-01")
df <- data.frame(id = c(1:6),
start = as.Date(start, format = "%Y-%m-%d"),
end = as.Date(end, format = "%Y-%m-%d"),
overlap_id = as.character(c("2, 4", "1, 3, 5", "2, 5", "1, 5, 6", "2, 3, 4", "4")))```
> df
id start end overlap_id
1 1 2007-06-27 2008-10-01 2, 4
2 2 2010-06-30 2010-07-01 1, 3, 5
3 3 2015-01-01 2017-02-02 2, 5
4 4 2012-01-01 2013-01-01 1, 5, 6
5 5 2010-01-01 2010-07-03 2, 3, 4
6 6 2009-01-01 2012-01-01 4
并非所有空间重叠的 ID 在时间上都重叠。我如何识别那些做的?换句话说,我需要在 overlap_id
上进行匹配(可以使用 tidyr::separate_rows(overlap_id)
以及 start/end 日期将其制作成更长的格式。我尝试使用 lubridate::interval
,但我无法确保重叠仅限于 overlap_id
.
这是我想要的输出:
> df
id start end overlap_id time_overlap overlap_dummy
1 1 2007-06-27 2008-10-01 2, 4 NA 0
2 2 2010-06-30 2010-07-01 1, 3, 5 5 1
3 3 2015-01-01 2017-02-02 2, 5 NA 0
4 4 2012-01-01 2013-01-01 1, 5, 6 6 1
5 5 2010-01-01 2010-07-03 2, 3, 4 2 1
6 6 2009-01-01 2012-01-01 4 6 1
如有任何帮助,我们将不胜感激!谢谢。
这是一种方法...
library(lubridate)
library(tidyverse)
df2 <- df %>% separate_rows(overlap_id, convert = TRUE) %>% #spread rows
left_join(df %>% select(-overlap_id) %>% #add dates for overlap rows
rename(start1 = start, #avoid name clash on join
end1 = end),
by = c("overlap_id" = "id")) %>%
filter(int_overlaps(interval(start, end), #delete non-overlapping
interval(start1, end1))) %>%
group_by(id) %>%
summarise(time_overlap = paste(overlap_id, sep = ", ")) %>% #paste overlaps if more than 1
right_join(df) #merge back to df (by id)
df2
# A tibble: 6 x 5
id time_overlap start end overlap_id
<int> <chr> <date> <date> <fct>
1 1 NA 2007-06-27 2008-10-01 2, 4
2 2 5 2010-06-30 2010-07-01 1, 3, 5
3 3 NA 2015-01-01 2017-02-02 2, 5
4 4 6 2012-01-01 2013-01-01 1, 5, 6
5 5 2 2010-01-01 2010-07-03 2, 3, 4
6 6 4 2009-01-01 2012-01-01 4