在对某些值进行操作时通过重叠周期加入
Join by overlapping periods while operating for some of the values
我正在尝试加入一个像这样的时期数据库:
id = c(rep(1,3), rep(2,3), rep(3,3))
start = as.Date(c("2014-07-01", "2015-03-12", "2016-08-13", "2014-07-01", "2015-03-12", "2016-08-13", "2014-07-01", "2015-03-12", "2016-08-13"))
end = as.Date(c("2015-03-11", "2015-08-12", "2018-12-31", "2015-03-11", "2015-08-12", "2018-12-31","2015-03-11", "2015-08-12", "2018-12-31"))
DT = data.table(id, start, end)
DT
id start end
1: 1 2014-07-01 2015-03-11
2: 1 2015-03-12 2015-08-12
3: 1 2016-08-13 2018-12-31
4: 2 2014-07-01 2015-03-11
5: 2 2015-03-12 2015-08-12
6: 2 2016-08-13 2018-12-31
7: 3 2014-07-01 2015-03-11
8: 3 2015-03-12 2015-08-12
9: 3 2016-08-13 2018-12-31
有像这样的临床登记(体重和身高)的人:
id_clin = (c(rep(1,2), rep (2,3), rep(3,4)))
date = as.Date(c("2014-10-23", "2016-09-01", "2017-01-01", "2014-08-01", "2015-02-01", "2017-06-01", "2018-03-05", "2018-09-01", "2018-11-30"))
weight = c(60, 65, 62, 75, 68, 90 , 102, 104 , 98 )
height = c(160, 160, 170, 175, 170, 200, 200, 200 ,200)
DT_clin = data.table(id_clin, date, weight, height)
DT_clin
id_clin date weight height
1: 1 2014-10-23 60 160
2: 1 2016-09-01 65 160
3: 2 2017-01-01 62 170
4: 2 2014-08-01 75 175
5: 2 2015-02-01 68 170
6: 3 2017-06-01 90 200
7: 3 2018-03-05 102 200
8: 3 2018-09-01 104 200
9: 3 2018-11-30 98 200
- 当一个id的临床测量(DT_clin)的注册表在同一id的周期(DT)的开始和结束之间时,必须连接注册表的值。
- 如果DT_clin在DT的周期之间没有值,则无需加入任何内容。
- 如果 DT 周期之间有多个值,我想计算重叠值的平均值。
想要的结果看起来像这样*:
id start end date date2 weight height
1: 1 2014-07-01 2015-03-11 2014-10-23 2014-10-23 60.0 160.0
2: 1 2015-03-12 2015-08-12 <NA> <NA> NA NA
3: 1 2016-08-13 2018-12-31 2016-09-01 2016-09-01 65.0 160.0
4: 2 2014-07-01 2015-03-11 2014-08-01 2015-02-01 71.5 172.5
5: 2 2015-03-12 2015-08-12 <NA> <NA> NA NA
6: 2 2016-08-13 2018-12-31 2017-01-01 2017-01-01 62.0 170.0
7: 3 2014-07-01 2015-03-11 <NA> <NA> NA NA
8: 3 2015-03-12 2015-08-12 <NA> <NA> NA NA
9: 3 2016-08-13 2018-12-31 2018-03-05 2018-11-30 101.3 200.0
此外,如果有一种方法可以对不同的变量进行多个操作,我也会有兴趣了解一种方法。 (例如,在我进行连接的同时计算体重的平均值和身高的最大值)
当只有一个值时,我测试了 foverlaps,结果很好,但是当有多个值重叠时,我无法完成 objective:
setkey(DT, id, start, end)
setkey(DT_clin, id_clin, date, date2)
foverlaps(DT[id == "1", ], DT_clin[id == "1",], by.x =c("id", "start", "end") , by.y = c("id_clin", "date", "date2" ), nomatch = NA )
我应该使用非等值连接吗?
提前感谢您的帮助:)
*我复制了 date 来创建 date2 并伪造了一个时间间隔
使用非相等连接,然后按 id、开始和结束进行汇总
ans <- DT_clin[DT, on = .(date >= start, date <= end, id_clin = id)]
ans[, .(date = min(date2),
date2 = max(date2),
weight = mean(weight),
height = mean(height)),
by = .(id = id_clin, start = date, end = date.1)]
# id start end date date2 weight height
# 1: 1 2014-07-01 2015-03-11 2014-10-23 2014-10-23 60.0 160.0
# 2: 1 2015-03-12 2015-08-12 <NA> <NA> NA NA
# 3: 1 2016-08-13 2018-12-31 2016-09-01 2016-09-01 65.0 160.0
# 4: 2 2014-07-01 2015-03-11 2014-08-01 2015-02-01 71.5 172.5
# 5: 2 2015-03-12 2015-08-12 <NA> <NA> NA NA
# 6: 2 2016-08-13 2018-12-31 2017-01-01 2017-01-01 62.0 170.0
# 7: 3 2014-07-01 2015-03-11 <NA> <NA> NA NA
# 8: 3 2015-03-12 2015-08-12 <NA> <NA> NA NA
# 9: 3 2016-08-13 2018-12-31 2017-06-01 2018-11-30 98.5 200.0
与foverlaps
:
library(data.table)
setkey(DT_clin, id_clin, date, date2)
foverlaps(DT, DT_clin,
by.x =c("id", "start", "end"),
by.y = c("id_clin", "date", "date2" ), nomatch = NA )[
,.(datemin = min(date),
datemax = max(date),
weight = mean(weight,na.r=T),
height = mean(height,na.rm=T)),
by=.(id,start,end)]
id start end datemin datemax weight height
1: 1 2014-07-01 2015-03-11 2014-10-23 2014-10-23 60.0 160.0
2: 1 2015-03-12 2015-08-12 <NA> <NA> NaN NaN
3: 1 2016-08-13 2018-12-31 2016-09-01 2016-09-01 65.0 160.0
4: 2 2014-07-01 2015-03-11 2014-08-01 2015-02-01 71.5 172.5
5: 2 2015-03-12 2015-08-12 <NA> <NA> NaN NaN
6: 2 2016-08-13 2018-12-31 2017-01-01 2017-01-01 62.0 170.0
7: 3 2014-07-01 2015-03-11 <NA> <NA> NaN NaN
8: 3 2015-03-12 2015-08-12 <NA> <NA> NaN NaN
9: 3 2016-08-13 2018-12-31 2017-06-01 2018-11-30 98.5 200.0
我正在尝试加入一个像这样的时期数据库:
id = c(rep(1,3), rep(2,3), rep(3,3))
start = as.Date(c("2014-07-01", "2015-03-12", "2016-08-13", "2014-07-01", "2015-03-12", "2016-08-13", "2014-07-01", "2015-03-12", "2016-08-13"))
end = as.Date(c("2015-03-11", "2015-08-12", "2018-12-31", "2015-03-11", "2015-08-12", "2018-12-31","2015-03-11", "2015-08-12", "2018-12-31"))
DT = data.table(id, start, end)
DT
id start end
1: 1 2014-07-01 2015-03-11
2: 1 2015-03-12 2015-08-12
3: 1 2016-08-13 2018-12-31
4: 2 2014-07-01 2015-03-11
5: 2 2015-03-12 2015-08-12
6: 2 2016-08-13 2018-12-31
7: 3 2014-07-01 2015-03-11
8: 3 2015-03-12 2015-08-12
9: 3 2016-08-13 2018-12-31
有像这样的临床登记(体重和身高)的人:
id_clin = (c(rep(1,2), rep (2,3), rep(3,4)))
date = as.Date(c("2014-10-23", "2016-09-01", "2017-01-01", "2014-08-01", "2015-02-01", "2017-06-01", "2018-03-05", "2018-09-01", "2018-11-30"))
weight = c(60, 65, 62, 75, 68, 90 , 102, 104 , 98 )
height = c(160, 160, 170, 175, 170, 200, 200, 200 ,200)
DT_clin = data.table(id_clin, date, weight, height)
DT_clin
id_clin date weight height
1: 1 2014-10-23 60 160
2: 1 2016-09-01 65 160
3: 2 2017-01-01 62 170
4: 2 2014-08-01 75 175
5: 2 2015-02-01 68 170
6: 3 2017-06-01 90 200
7: 3 2018-03-05 102 200
8: 3 2018-09-01 104 200
9: 3 2018-11-30 98 200
- 当一个id的临床测量(DT_clin)的注册表在同一id的周期(DT)的开始和结束之间时,必须连接注册表的值。
- 如果DT_clin在DT的周期之间没有值,则无需加入任何内容。
- 如果 DT 周期之间有多个值,我想计算重叠值的平均值。
想要的结果看起来像这样*:
id start end date date2 weight height
1: 1 2014-07-01 2015-03-11 2014-10-23 2014-10-23 60.0 160.0
2: 1 2015-03-12 2015-08-12 <NA> <NA> NA NA
3: 1 2016-08-13 2018-12-31 2016-09-01 2016-09-01 65.0 160.0
4: 2 2014-07-01 2015-03-11 2014-08-01 2015-02-01 71.5 172.5
5: 2 2015-03-12 2015-08-12 <NA> <NA> NA NA
6: 2 2016-08-13 2018-12-31 2017-01-01 2017-01-01 62.0 170.0
7: 3 2014-07-01 2015-03-11 <NA> <NA> NA NA
8: 3 2015-03-12 2015-08-12 <NA> <NA> NA NA
9: 3 2016-08-13 2018-12-31 2018-03-05 2018-11-30 101.3 200.0
此外,如果有一种方法可以对不同的变量进行多个操作,我也会有兴趣了解一种方法。 (例如,在我进行连接的同时计算体重的平均值和身高的最大值)
当只有一个值时,我测试了 foverlaps,结果很好,但是当有多个值重叠时,我无法完成 objective:
setkey(DT, id, start, end)
setkey(DT_clin, id_clin, date, date2)
foverlaps(DT[id == "1", ], DT_clin[id == "1",], by.x =c("id", "start", "end") , by.y = c("id_clin", "date", "date2" ), nomatch = NA )
我应该使用非等值连接吗?
提前感谢您的帮助:)
*我复制了 date 来创建 date2 并伪造了一个时间间隔
使用非相等连接,然后按 id、开始和结束进行汇总
ans <- DT_clin[DT, on = .(date >= start, date <= end, id_clin = id)]
ans[, .(date = min(date2),
date2 = max(date2),
weight = mean(weight),
height = mean(height)),
by = .(id = id_clin, start = date, end = date.1)]
# id start end date date2 weight height
# 1: 1 2014-07-01 2015-03-11 2014-10-23 2014-10-23 60.0 160.0
# 2: 1 2015-03-12 2015-08-12 <NA> <NA> NA NA
# 3: 1 2016-08-13 2018-12-31 2016-09-01 2016-09-01 65.0 160.0
# 4: 2 2014-07-01 2015-03-11 2014-08-01 2015-02-01 71.5 172.5
# 5: 2 2015-03-12 2015-08-12 <NA> <NA> NA NA
# 6: 2 2016-08-13 2018-12-31 2017-01-01 2017-01-01 62.0 170.0
# 7: 3 2014-07-01 2015-03-11 <NA> <NA> NA NA
# 8: 3 2015-03-12 2015-08-12 <NA> <NA> NA NA
# 9: 3 2016-08-13 2018-12-31 2017-06-01 2018-11-30 98.5 200.0
与foverlaps
:
library(data.table)
setkey(DT_clin, id_clin, date, date2)
foverlaps(DT, DT_clin,
by.x =c("id", "start", "end"),
by.y = c("id_clin", "date", "date2" ), nomatch = NA )[
,.(datemin = min(date),
datemax = max(date),
weight = mean(weight,na.r=T),
height = mean(height,na.rm=T)),
by=.(id,start,end)]
id start end datemin datemax weight height
1: 1 2014-07-01 2015-03-11 2014-10-23 2014-10-23 60.0 160.0
2: 1 2015-03-12 2015-08-12 <NA> <NA> NaN NaN
3: 1 2016-08-13 2018-12-31 2016-09-01 2016-09-01 65.0 160.0
4: 2 2014-07-01 2015-03-11 2014-08-01 2015-02-01 71.5 172.5
5: 2 2015-03-12 2015-08-12 <NA> <NA> NaN NaN
6: 2 2016-08-13 2018-12-31 2017-01-01 2017-01-01 62.0 170.0
7: 3 2014-07-01 2015-03-11 <NA> <NA> NaN NaN
8: 3 2015-03-12 2015-08-12 <NA> <NA> NaN NaN
9: 3 2016-08-13 2018-12-31 2017-06-01 2018-11-30 98.5 200.0