在对某些值进行操作时通过重叠周期加入

Question

我正在尝试加入一个像这样的时期数据库：

id = c(rep(1,3), rep(2,3), rep(3,3))
start = as.Date(c("2014-07-01", "2015-03-12", "2016-08-13", "2014-07-01", "2015-03-12", "2016-08-13", "2014-07-01", "2015-03-12", "2016-08-13"))
end = as.Date(c("2015-03-11", "2015-08-12", "2018-12-31", "2015-03-11", "2015-08-12", "2018-12-31","2015-03-11", "2015-08-12", "2018-12-31"))

DT = data.table(id, start, end)

DT

   id      start        end
1:  1 2014-07-01 2015-03-11
2:  1 2015-03-12 2015-08-12
3:  1 2016-08-13 2018-12-31
4:  2 2014-07-01 2015-03-11
5:  2 2015-03-12 2015-08-12
6:  2 2016-08-13 2018-12-31
7:  3 2014-07-01 2015-03-11
8:  3 2015-03-12 2015-08-12
9:  3 2016-08-13 2018-12-31

有像这样的临床登记（体重和身高）的人：

id_clin = (c(rep(1,2), rep (2,3), rep(3,4)))
date = as.Date(c("2014-10-23", "2016-09-01", "2017-01-01", "2014-08-01", "2015-02-01", "2017-06-01", "2018-03-05", "2018-09-01", "2018-11-30"))
weight = c(60, 65, 62, 75, 68, 90 , 102, 104 , 98 )
height = c(160, 160, 170, 175, 170, 200, 200, 200 ,200)

DT_clin = data.table(id_clin, date, weight, height)

DT_clin

   id_clin       date weight height
1:       1 2014-10-23     60    160
2:       1 2016-09-01     65    160
3:       2 2017-01-01     62    170
4:       2 2014-08-01     75    175
5:       2 2015-02-01     68    170
6:       3 2017-06-01     90    200
7:       3 2018-03-05    102    200
8:       3 2018-09-01    104    200
9:       3 2018-11-30     98    200

当一个id的临床测量（DT_clin）的注册表在同一id的周期（DT）的开始和结束之间时，必须连接注册表的值。
如果DT_clin在DT的周期之间没有值，则无需加入任何内容。
如果 DT 周期之间有多个值，我想计算重叠值的平均值。

想要的结果看起来像这样*:

   id      start        end       date       date2       weight       height
1:  1 2014-07-01 2015-03-11 2014-10-23  2014-10-23         60.0        160.0
2:  1 2015-03-12 2015-08-12       <NA>        <NA>           NA           NA
3:  1 2016-08-13 2018-12-31 2016-09-01  2016-09-01         65.0        160.0
4:  2 2014-07-01 2015-03-11 2014-08-01  2015-02-01         71.5        172.5
5:  2 2015-03-12 2015-08-12       <NA>        <NA>           NA           NA
6:  2 2016-08-13 2018-12-31 2017-01-01  2017-01-01         62.0        170.0
7:  3 2014-07-01 2015-03-11       <NA>        <NA>           NA           NA
8:  3 2015-03-12 2015-08-12       <NA>        <NA>           NA           NA
9:  3 2016-08-13 2018-12-31 2018-03-05  2018-11-30        101.3        200.0

此外，如果有一种方法可以对不同的变量进行多个操作，我也会有兴趣了解一种方法。（例如，在我进行连接的同时计算体重的平均值和身高的最大值）

当只有一个值时，我测试了 foverlaps，结果很好，但是当有多个值重叠时，我无法完成 objective:

setkey(DT, id, start, end)
setkey(DT_clin, id_clin, date, date2)

foverlaps(DT[id == "1", ], DT_clin[id == "1",], by.x =c("id", "start", "end") , by.y = c("id_clin", "date", "date2" ), nomatch = NA )

我应该使用非等值连接吗？

提前感谢您的帮助:)

*我复制了 date 来创建 date2 并伪造了一个时间间隔

Answer 1

使用非相等连接，然后按 id、开始和结束进行汇总

ans <- DT_clin[DT, on = .(date >= start, date <= end, id_clin = id)]
ans[, .(date   = min(date2),
        date2  = max(date2),
        weight = mean(weight),
        height = mean(height)), 
    by = .(id = id_clin, start = date, end = date.1)]

#    id      start        end       date      date2 weight height
# 1:  1 2014-07-01 2015-03-11 2014-10-23 2014-10-23   60.0  160.0
# 2:  1 2015-03-12 2015-08-12       <NA>       <NA>     NA     NA
# 3:  1 2016-08-13 2018-12-31 2016-09-01 2016-09-01   65.0  160.0
# 4:  2 2014-07-01 2015-03-11 2014-08-01 2015-02-01   71.5  172.5
# 5:  2 2015-03-12 2015-08-12       <NA>       <NA>     NA     NA
# 6:  2 2016-08-13 2018-12-31 2017-01-01 2017-01-01   62.0  170.0
# 7:  3 2014-07-01 2015-03-11       <NA>       <NA>     NA     NA
# 8:  3 2015-03-12 2015-08-12       <NA>       <NA>     NA     NA
# 9:  3 2016-08-13 2018-12-31 2017-06-01 2018-11-30   98.5  200.0

Answer 2

与foverlaps:

library(data.table)
setkey(DT_clin, id_clin, date, date2)

foverlaps(DT, DT_clin, 
          by.x =c("id", "start", "end"), 
          by.y = c("id_clin", "date", "date2" ), nomatch = NA )[
          ,.(datemin = min(date),
             datemax = max(date),
             weight  = mean(weight,na.r=T),
             height  = mean(height,na.rm=T)),
           by=.(id,start,end)]

   id      start        end    datemin    datemax weight height
1:  1 2014-07-01 2015-03-11 2014-10-23 2014-10-23   60.0  160.0
2:  1 2015-03-12 2015-08-12       <NA>       <NA>    NaN    NaN
3:  1 2016-08-13 2018-12-31 2016-09-01 2016-09-01   65.0  160.0
4:  2 2014-07-01 2015-03-11 2014-08-01 2015-02-01   71.5  172.5
5:  2 2015-03-12 2015-08-12       <NA>       <NA>    NaN    NaN
6:  2 2016-08-13 2018-12-31 2017-01-01 2017-01-01   62.0  170.0
7:  3 2014-07-01 2015-03-11       <NA>       <NA>    NaN    NaN
8:  3 2015-03-12 2015-08-12       <NA>       <NA>    NaN    NaN
9:  3 2016-08-13 2018-12-31 2017-06-01 2018-11-30   98.5  200.0

在对某些值进行操作时通过重叠周期加入

Join by overlapping periods while operating for some of the values

time

join

r

data.table

non-equi-join