匹配具有重叠间隔的列 (lubridate)

Question

我有两个不同行数和列数的数据框：每个数据框都有一个日期间隔。 df 有一个额外的列，表示某种属性。我的目标是在特定条件下将信息从 df（具有属性）提取到 df2。程序应该如下：

对于df2的每个日期区间，检查df中是否有任何区间与df2的区间重叠。如果是，则在df2中创建一列，表示与df的重叠区间匹配的属性。可以有多个属性匹配到 df2 的特定区间。

我创建了以下数据示例：

library(lubridate)
date1 <- as.Date(c('2017-11-1','2017-11-1','2017-11-4'))
date2 <- as.Date(c('2017-11-5','2017-11-3','2017-11-5'))
df <- data.frame(matrix(NA,nrow=3, ncol = 4)) 
names(df) <- c("Begin_A", "End_A", "Interval", "Attribute")
df$Begin_A <-date1
df$End_A <-date2

df$Interval <-df$Begin_A %--% df$End_A
df$Attribute<- as.character(c("Attr1","Attr2","Attr3"))

### Second df:

date1 <- as.Date(c('2017-11-2','2017-11-5','2017-11-7','2017-11-1'))
date2 <- as.Date(c('2017-11-3','2017-11-6','2017-11-8','2017-11-1'))
df2 <- data.frame(matrix(NA,nrow=4, ncol = 3)) 
names(df2) <- c("Begin_A", "End_A", "Interval")
df2$Begin_A <-date1
df2$End_A <-date2
df2$Interval <-df2$Begin_A %--% df2$End_A

这会产生这些数据框：

df:

Begin_A      End_A        Interval                         Attribute
2017-11-01   2017-11-05   2017-11-01 UTC--2017-11-05 UTC   Attr1
2017-11-01   2017-11-03   2017-11-01 UTC--2017-11-03 UTC   Attr2
2017-11-04   2017-11-05   2017-11-04 UTC--2017-11-05 UTC   Attr3

df2:

Begin_A      End_A        Interval
2017-11-02   2017-11-03   2017-11-02 UTC--2017-11-03 UTC
2017-11-05   2017-11-06   2017-11-05 UTC--2017-11-06 UTC
2017-11-07   2017-11-08   2017-11-07 UTC--2017-11-08 UTC
2017-11-01   2017-11-01   2017-11-01 UTC--2017-11-01 UTC

我想要的数据框如下所示：

Begin_A      End_A        Interval                         Matched_Attr 
2017-11-02   2017-11-03   2017-11-02 UTC--2017-11-03 UTC   Attr1;Attr2
2017-11-05   2017-11-06   2017-11-05 UTC--2017-11-06 UTC   Attr1;Attr3
2017-11-07   2017-11-08   2017-11-07 UTC--2017-11-08 UTC   NA
2017-11-01   2017-11-01   2017-11-01 UTC--2017-11-01 UTC   Attr1;Attr2

我已经查看了 int_overlaps() 函数，但无法使 "scanning through all intervals of another column" 部分起作用。如果是，有没有利用 tidyr 环境的解决方案？

Answer 1

使用 tidyverse 的 lubridate 包及其函数 int_overlaps()，您可以创建一个简单的 for 循环来遍历 df2$Interval 的各个值，如下所示：

df2$Matched_Attr <- NA
for(i in 1:nrow(df2)){
  df2$Matched_Attr[i] <-  paste(df$Attribute[int_overlaps(df2$Interval[i], df$Interval)], collapse=", ")
}

给出以下结果

#     Begin_A      End_A                       Interval Matched_Attr
#1 2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1, Attr2
#2 2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC Attr1, Attr3
#3 2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC             
#4 2017-11-01 2017-11-01 2017-11-01 UTC--2017-11-01 UTC Attr1, Attr2

我让 NA 策略处于打开状态，但附加行 df2$Matched_Attr[df2$Matched_Attr==""]<-NA 将 return 完全符合预期的结果。

针对您的评论（仅在满足df$ID[i]==df2$ID[i]条件时执行上述操作），实现如下：

library(lubridate)
#df
df <- data.frame(Attribute=c("Attr1","Attr2","Attr3"),
                 ID = c(3,2,1),
                 Begin_A=as.Date(c('2017-11-1','2017-11-1','2017-11-4')),
                 End_A=as.Date(c('2017-11-5','2017-11-3','2017-11-5')))
df$Interval <- df$Begin_A %--% df$End_A

### Second df:
df2 <- data.frame(ID=c(3,4,5),
                  Begin_A=as.Date(c('2017-11-2','2017-11-5','2017-11-7')),
                  End_A=as.Date(c('2017-11-3','2017-11-6','2017-11-8')))
df2$Interval <- df2$Begin_A %--% df2$End_A

df2$Matched_Attr <- NA
for(i in 1:nrow(df2)){
  if(df2$ID[i]==df$ID[i]){
  df2$Matched_Attr[i] <-  paste(df$Attribute[int_overlaps(df2$Interval[i], df$Interval)], collapse=", ")
  }
}
print(df2)
#  ID    Begin_A      End_A                       Interval Matched_Attr
#1  3 2017-11-02 2017-11-03 2017-11-02 UTC--2017-11-03 UTC Attr1, Attr2
#2  4 2017-11-05 2017-11-06 2017-11-05 UTC--2017-11-06 UTC         <NA>
#3  5 2017-11-07 2017-11-08 2017-11-07 UTC--2017-11-08 UTC         <NA>

匹配具有重叠间隔的列 (lubridate)

Matching Columns with Overlapping Intervals (lubridate)

r

lubridate

tidyr