识别分组数据的重叠日期间隔

Identifying overlapping date intervals on grouped data

我有一个大型数据集,我想在其中识别时间重叠的观察结果 space。每个观察都有一个唯一的 ID,我已经确定了 space 中重叠的那些,由 overlap_space 给出。现在,我想检查 space 中重叠的观测值 start/end-points 是否也重叠。

下面给出一个简单的例子:

start <- c("2007-06-27", "2010-06-30", "2015-01-01", "2012-01-01", "2010-01-01", "2009-01-01")
end <- c("2008-10-01", "2010-07-01", "2017-02-02", "2013-01-01", "2010-07-03", "2012-01-01")
df <- data.frame(id = c(1:6),
                 start = as.Date(start, format = "%Y-%m-%d"),
                 end = as.Date(end, format = "%Y-%m-%d"),
                 overlap_id = as.character(c("2, 4", "1, 3, 5", "2, 5", "1, 5, 6", "2, 3, 4", "4")))```

> df
  id      start        end overlap_id
1  1 2007-06-27 2008-10-01       2, 4
2  2 2010-06-30 2010-07-01    1, 3, 5
3  3 2015-01-01 2017-02-02       2, 5
4  4 2012-01-01 2013-01-01    1, 5, 6
5  5 2010-01-01 2010-07-03    2, 3, 4
6  6 2009-01-01 2012-01-01          4

并非所有空间重叠的 ID 在时间上都重叠。我如何识别那些做的?换句话说,我需要在 overlap_id 上进行匹配(可以使用 tidyr::separate_rows(overlap_id) 以及 start/end 日期将其制作成更长的格式。我尝试使用 lubridate::interval,但我无法确保重叠仅限于 overlap_id.

中确定的重叠

这是我想要的输出:

> df
  id      start        end overlap_id  time_overlap overlap_dummy
1  1 2007-06-27 2008-10-01       2, 4           NA             0 
2  2 2010-06-30 2010-07-01    1, 3, 5            5             1
3  3 2015-01-01 2017-02-02       2, 5           NA             0
4  4 2012-01-01 2013-01-01    1, 5, 6            6             1
5  5 2010-01-01 2010-07-03    2, 3, 4            2             1
6  6 2009-01-01 2012-01-01          4            6             1

如有任何帮助,我们将不胜感激!谢谢。

这是一种方法...

library(lubridate)
library(tidyverse)

df2 <- df %>% separate_rows(overlap_id, convert = TRUE) %>%   #spread rows
  left_join(df %>% select(-overlap_id) %>%                    #add dates for overlap rows
                   rename(start1 = start,                     #avoid name clash on join
                          end1 = end),
            by = c("overlap_id" = "id")) %>% 
  filter(int_overlaps(interval(start, end),                   #delete non-overlapping
                      interval(start1, end1))) %>% 
  group_by(id) %>% 
  summarise(time_overlap = paste(overlap_id, sep = ", ")) %>% #paste overlaps if more than 1
  right_join(df)                                              #merge back to df (by id)

df2
# A tibble: 6 x 5
     id time_overlap start      end        overlap_id
  <int> <chr>        <date>     <date>     <fct>     
1     1 NA           2007-06-27 2008-10-01 2, 4      
2     2 5            2010-06-30 2010-07-01 1, 3, 5   
3     3 NA           2015-01-01 2017-02-02 2, 5      
4     4 6            2012-01-01 2013-01-01 1, 5, 6   
5     5 2            2010-01-01 2010-07-03 2, 3, 4   
6     6 4            2009-01-01 2012-01-01 4