围绕给定事件高效实施编号 window 变量
Efficient implementation of a numbered window variable around given events
我有一个连续的数据集和一个“事件”日期向量。我想在每个事件前后为给定长度的 windows 创建一个带编号的 windows 变量。我有一个可以工作的代码,但速度慢得离谱,我想知道提高其效率的最佳方法。
下面我放了代码。我还有一个函数 create_date_vector,它只保留足够分隔的日期,以便 windows 中没有重叠,这更能使下面的示例运行(但显然也欢迎对其进行改进) .
data <- data.frame(day = seq(as.Date("2000-01-01"), as.Date("2001-01-01"), by = "day"))
dates <- sample(seq(as.Date("2000-01-01"), as.Date("2001-01-01"), by = "day"), 30)
pre <- 3
post <- 3
create_date_vector <- function(dates, pre, post){
t_dates_dif <- diff(dates)
selected_dates <- c()
for(i in 1:(length(t_dates_dif) - 1)){
selected_dates <- c(selected_dates, (t_dates_dif[i] > pre + post) + (t_dates_dif[i+1] > pre + post))
}
return(dates[which(selected_dates == 2) + 1])
}
dates_chosen <- sort(create_date_vector(dates, pre, post))
真正需要优化的是以下创建 windows:
的代码
data$event <- NA
for(i in 1:length(dates_chosen)){
data <- data %>%
mutate(
event = ifelse(day >= dates_chosen[i] - pre & day <= dates_chosen[i] + post, i, event)
)
}
感谢您的帮助。
使用 lead
可能更容易
library(dplyr)
create_date_vector2 <- function(dates, pre, post) {
t1 <- diff(dates)
pre_post <- pre + post
dates[which(((t1 > pre_post) + (dplyr::lead(t1) > pre_post)) == 2) + 1]
}
-测试
> create_date_vector2(dates, 3, 3)
[1] "2011-06-17" "2008-07-30" "2002-02-19"
-OP 函数的输出
> create_date_vector(dates, pre, post)
[1] "2011-06-17" "2008-07-30" "2002-02-19"
事件日期周围的 windows 可以通过 使用助手 table
在非 equi 连接 中更新来创建
library(data.table)
# create helper table
events <- data.table(dates_chosen)[
, `:=`(rn = .I, from = dates_chosen - pre, to = dates_chosen + post)]
# update in a non-equi join
setDT(data)[events, on = .(day >= from, day <= to), event := rn][]
day event
1: 2000-01-01 NA
2: 2000-01-02 NA
3: 2000-01-03 NA
4: 2000-01-04 NA
5: 2000-01-05 NA
---
363: 2000-12-28 NA
364: 2000-12-29 NA
365: 2000-12-30 NA
366: 2000-12-31 NA
367: 2001-01-01 NA
# show only updated rows
data[!is.na(event)]
day event
1: 2000-05-16 1
2: 2000-05-17 1
3: 2000-05-18 1
4: 2000-05-19 1
5: 2000-05-20 1
6: 2000-05-21 1
7: 2000-05-22 1
8: 2000-06-17 2
9: 2000-06-18 2
10: 2000-06-19 2
11: 2000-06-20 2
12: 2000-06-21 2
13: 2000-06-22 2
14: 2000-06-23 2
15: 2000-10-26 3
16: 2000-10-27 3
17: 2000-10-28 3
18: 2000-10-29 3
19: 2000-10-30 3
20: 2000-10-31 3
21: 2000-11-01 3
day event
帮手table是
events[]
dates_chosen rn from to
1: 2000-05-19 1 2000-05-16 2000-05-22
2: 2000-06-20 2 2000-06-17 2000-06-23
3: 2000-10-29 3 2000-10-26 2000-11-01
我有一个连续的数据集和一个“事件”日期向量。我想在每个事件前后为给定长度的 windows 创建一个带编号的 windows 变量。我有一个可以工作的代码,但速度慢得离谱,我想知道提高其效率的最佳方法。
下面我放了代码。我还有一个函数 create_date_vector,它只保留足够分隔的日期,以便 windows 中没有重叠,这更能使下面的示例运行(但显然也欢迎对其进行改进) .
data <- data.frame(day = seq(as.Date("2000-01-01"), as.Date("2001-01-01"), by = "day"))
dates <- sample(seq(as.Date("2000-01-01"), as.Date("2001-01-01"), by = "day"), 30)
pre <- 3
post <- 3
create_date_vector <- function(dates, pre, post){
t_dates_dif <- diff(dates)
selected_dates <- c()
for(i in 1:(length(t_dates_dif) - 1)){
selected_dates <- c(selected_dates, (t_dates_dif[i] > pre + post) + (t_dates_dif[i+1] > pre + post))
}
return(dates[which(selected_dates == 2) + 1])
}
dates_chosen <- sort(create_date_vector(dates, pre, post))
真正需要优化的是以下创建 windows:
的代码data$event <- NA
for(i in 1:length(dates_chosen)){
data <- data %>%
mutate(
event = ifelse(day >= dates_chosen[i] - pre & day <= dates_chosen[i] + post, i, event)
)
}
感谢您的帮助。
使用 lead
library(dplyr)
create_date_vector2 <- function(dates, pre, post) {
t1 <- diff(dates)
pre_post <- pre + post
dates[which(((t1 > pre_post) + (dplyr::lead(t1) > pre_post)) == 2) + 1]
}
-测试
> create_date_vector2(dates, 3, 3)
[1] "2011-06-17" "2008-07-30" "2002-02-19"
-OP 函数的输出
> create_date_vector(dates, pre, post)
[1] "2011-06-17" "2008-07-30" "2002-02-19"
事件日期周围的 windows 可以通过 使用助手 table
在非 equi 连接 中更新来创建library(data.table)
# create helper table
events <- data.table(dates_chosen)[
, `:=`(rn = .I, from = dates_chosen - pre, to = dates_chosen + post)]
# update in a non-equi join
setDT(data)[events, on = .(day >= from, day <= to), event := rn][]
day event 1: 2000-01-01 NA 2: 2000-01-02 NA 3: 2000-01-03 NA 4: 2000-01-04 NA 5: 2000-01-05 NA --- 363: 2000-12-28 NA 364: 2000-12-29 NA 365: 2000-12-30 NA 366: 2000-12-31 NA 367: 2001-01-01 NA
# show only updated rows
data[!is.na(event)]
day event 1: 2000-05-16 1 2: 2000-05-17 1 3: 2000-05-18 1 4: 2000-05-19 1 5: 2000-05-20 1 6: 2000-05-21 1 7: 2000-05-22 1 8: 2000-06-17 2 9: 2000-06-18 2 10: 2000-06-19 2 11: 2000-06-20 2 12: 2000-06-21 2 13: 2000-06-22 2 14: 2000-06-23 2 15: 2000-10-26 3 16: 2000-10-27 3 17: 2000-10-28 3 18: 2000-10-29 3 19: 2000-10-30 3 20: 2000-10-31 3 21: 2000-11-01 3 day event
帮手table是
events[]
dates_chosen rn from to 1: 2000-05-19 1 2000-05-16 2000-05-22 2: 2000-06-20 2 2000-06-17 2000-06-23 3: 2000-10-29 3 2000-10-26 2000-11-01