围绕给定事件高效实施编号 window 变量

Efficient implementation of a numbered window variable around given events

我有一个连续的数据集和一个“事件”日期向量。我想在每个事件前后为给定长度的 windows 创建一个带编号的 windows 变量。我有一个可以工作的代码,但速度慢得离谱,我想知道提高其效率的最佳方法。

下面我放了代码。我还有一个函数 create_date_vector,它只保留足够分隔的日期,以便 windows 中没有重叠,这更能使下面的示例运行(但显然也欢迎对其进行改进) .

data <- data.frame(day = seq(as.Date("2000-01-01"), as.Date("2001-01-01"), by = "day"))

dates <- sample(seq(as.Date("2000-01-01"), as.Date("2001-01-01"), by = "day"), 30)

pre <- 3
post <- 3

create_date_vector <- function(dates, pre, post){
  
  t_dates_dif <- diff(dates)
  selected_dates <- c()
  
  for(i in 1:(length(t_dates_dif) - 1)){
    selected_dates <- c(selected_dates, (t_dates_dif[i] > pre + post) + (t_dates_dif[i+1] > pre + post))
  }
  return(dates[which(selected_dates == 2) + 1])
}

dates_chosen <- sort(create_date_vector(dates, pre, post))

真正需要优化的是以下创建 windows:

的代码
data$event <- NA
for(i in 1:length(dates_chosen)){
  data <- data %>%
    mutate(
      event = ifelse(day >= dates_chosen[i] - pre & day <= dates_chosen[i] + post, i, event)
    )
}

感谢您的帮助。

使用 lead

可能更容易
library(dplyr)
create_date_vector2 <- function(dates, pre, post) {
      t1 <- diff(dates)      
      pre_post <- pre + post
      dates[which(((t1 > pre_post) + (dplyr::lead(t1) > pre_post)) == 2) + 1]
}

-测试

> create_date_vector2(dates, 3, 3)
[1] "2011-06-17" "2008-07-30" "2002-02-19"

-OP 函数的输出

> create_date_vector(dates, pre, post)
[1] "2011-06-17" "2008-07-30" "2002-02-19"

事件日期周围的 windows 可以通过 使用助手 table

在非 equi 连接 中更新来创建
library(data.table)
# create helper table
events <- data.table(dates_chosen)[
  , `:=`(rn = .I, from = dates_chosen - pre, to = dates_chosen + post)]
# update in a non-equi join 
setDT(data)[events, on = .(day >= from, day <= to), event := rn][]
            day event
  1: 2000-01-01    NA
  2: 2000-01-02    NA
  3: 2000-01-03    NA
  4: 2000-01-04    NA
  5: 2000-01-05    NA
 ---                 
363: 2000-12-28    NA
364: 2000-12-29    NA
365: 2000-12-30    NA
366: 2000-12-31    NA
367: 2001-01-01    NA
# show only updated rows
data[!is.na(event)]
           day event
 1: 2000-05-16     1
 2: 2000-05-17     1
 3: 2000-05-18     1
 4: 2000-05-19     1
 5: 2000-05-20     1
 6: 2000-05-21     1
 7: 2000-05-22     1
 8: 2000-06-17     2
 9: 2000-06-18     2
10: 2000-06-19     2
11: 2000-06-20     2
12: 2000-06-21     2
13: 2000-06-22     2
14: 2000-06-23     2
15: 2000-10-26     3
16: 2000-10-27     3
17: 2000-10-28     3
18: 2000-10-29     3
19: 2000-10-30     3
20: 2000-10-31     3
21: 2000-11-01     3
           day event

帮手table是

events[]
   dates_chosen rn       from         to
1:   2000-05-19  1 2000-05-16 2000-05-22
2:   2000-06-20  2 2000-06-17 2000-06-23
3:   2000-10-29  3 2000-10-26 2000-11-01