连接两个数据表以按日期范围覆盖值
Join two data tables to override values by date range
我想根据另一个 table 中的覆盖更正一个 table。当 dt_override 具有该单位并且日期范围与 dt_current 重叠时,我想更改 dt_current 中的值。
dt_current <- data.table( unit = c(rep("a",10), rep("b", 10)),
date = seq(as.Date("2015-1-1"), by = "day", length.out = 10),
num = 1:10, key = "unit")
dt_override <- data.table( unit = c("a", "a", "b", "zed" ), start_date = as.Date(c("2015-01-03", "1492-12-25", "2015-01-02", "2015-01-11")),
end_date = as.Date(c("2015-01-05", "1492-12-26", "2015-01-04", "2015-01-14")),
value = NA, key = "unit")
似乎我应该在加入两个数据时使用某种形式的 .EACHI tables,编码如下,认为它不起作用或当然。
dt_current[dt_override,
num := if(i.start_date <= date & i.end_date >= date) i.value,
by = .EACHI]
这是一种枚举日期序列的方法:
dt_override[,value:=as.integer(value)]
# It's necessary to convert to integer because `NA` is logical unless otherwise specified.
dto = dt_override[,.(
unit,
date = seq.Date(start_date,end_date,by="day"),
value
),by=seq_along(dt_override)][,seq_along:=NULL]
setkey(dt_current,unit,date)
dt_current[dto,num:=i.value]
现在有了 foverlaps
,可能会有更好的方法。
使用 foverlaps
可以做到
dt_current[, date2 := date] # define end date
setkey(dt_current, unit, date, date2) # key by unit, start and end dates
setkey(dt_override, unit, start_date, end_date) # same
第一个选项,通过引用创建索引和更新
indx <- foverlaps(dt_override, dt_current, which = TRUE) # run foverlaps and get indices
dt_current[indx$yid, num := dt_override[indx$xid, value]] # adjust by reference
或者,您可以 运行 foverlaps
相反,避免创建 indx
但同时创建一个全新的数据集
foverlaps(dt_current, dt_override)[!is.na(start_date), num := value
][, .SD, .SDcols = names(dt_current)]
另一种选择,使用滚动连接:
setkey(dt_current, unit, date)
setkey(dt_override, unit, start_date)
dt_current[, num := dt_override[dt_current, roll = T][end_date >= start_date,
num := value]$num]
# another version of the above, but using ifelse (unclear to me which one is faster)
dt_current[, num := dt_override[dt_current,
ifelse(end_date >= start_date, value, num), roll = T]]
我想根据另一个 table 中的覆盖更正一个 table。当 dt_override 具有该单位并且日期范围与 dt_current 重叠时,我想更改 dt_current 中的值。
dt_current <- data.table( unit = c(rep("a",10), rep("b", 10)),
date = seq(as.Date("2015-1-1"), by = "day", length.out = 10),
num = 1:10, key = "unit")
dt_override <- data.table( unit = c("a", "a", "b", "zed" ), start_date = as.Date(c("2015-01-03", "1492-12-25", "2015-01-02", "2015-01-11")),
end_date = as.Date(c("2015-01-05", "1492-12-26", "2015-01-04", "2015-01-14")),
value = NA, key = "unit")
似乎我应该在加入两个数据时使用某种形式的 .EACHI tables,编码如下,认为它不起作用或当然。
dt_current[dt_override,
num := if(i.start_date <= date & i.end_date >= date) i.value,
by = .EACHI]
这是一种枚举日期序列的方法:
dt_override[,value:=as.integer(value)]
# It's necessary to convert to integer because `NA` is logical unless otherwise specified.
dto = dt_override[,.(
unit,
date = seq.Date(start_date,end_date,by="day"),
value
),by=seq_along(dt_override)][,seq_along:=NULL]
setkey(dt_current,unit,date)
dt_current[dto,num:=i.value]
现在有了 foverlaps
,可能会有更好的方法。
使用 foverlaps
可以做到
dt_current[, date2 := date] # define end date
setkey(dt_current, unit, date, date2) # key by unit, start and end dates
setkey(dt_override, unit, start_date, end_date) # same
第一个选项,通过引用创建索引和更新
indx <- foverlaps(dt_override, dt_current, which = TRUE) # run foverlaps and get indices
dt_current[indx$yid, num := dt_override[indx$xid, value]] # adjust by reference
或者,您可以 运行 foverlaps
相反,避免创建 indx
但同时创建一个全新的数据集
foverlaps(dt_current, dt_override)[!is.na(start_date), num := value
][, .SD, .SDcols = names(dt_current)]
另一种选择,使用滚动连接:
setkey(dt_current, unit, date)
setkey(dt_override, unit, start_date)
dt_current[, num := dt_override[dt_current, roll = T][end_date >= start_date,
num := value]$num]
# another version of the above, but using ifelse (unclear to me which one is faster)
dt_current[, num := dt_override[dt_current,
ifelse(end_date >= start_date, value, num), roll = T]]