如何找到多年是否在患者随访 R 内

Question

如果问题在技术上是重复的，请随时删除它，但我看过很多类似的答案，none 其中的答案适用于我的数据。

我有患者的随访数据，像这样：

ID   start.date   end.date
1    1999-03-02   2003-06-15
2    1995-11-23   2007-09-26
..
.. 
n    2007-02-19   2010-08-06

这很简单，我有超过 400 万个 ID。

我试图找出从 1990 年到 2016 年每年注册了多少个 id，以便计算每年的发病率（省略了疾病状态列）。我想要如下数据集：

ID   start.date   end.date    y1990 ... y1995 ..  y2000 ..  y2005 ..  y2016
1    1999-03-02   2003-06-15    0         0         1          0        0
2    1990-11-23   2007-09-26    1         1         1          1        0
..
.. 
n    2005-02-19   2016-08-06    0         0         0          1        1

如果该患者在那一年仍然 "registered"，则每列的值为 1，否则为 0。

作为旁注，如果有人知道 R 中的一个包可以计算分层发生率，那会更好，但就目前而言，我无法让他们中的任何一个做我想让他们做的事。

我尝试了 data.table、lubridate 和 dplyr 的各种解决方案，但都无济于事。将不胜感激。

Answer 1

你可以这样做：

library(tidyverse)
df %>%
    mutate(year = as.numeric(sub("-\d+-\d+$", "", start.date))) %>%
    group_by(ID) %>%
    mutate(n = 1:n()) %>%
    spread(year, n, fill = 0)
## A tibble: 3 x 6
## Groups:   ID [3]
#  ID    start.date end.date   `1995` `1999` `2007`
#  <fct> <fct>      <fct>       <dbl>  <dbl>  <dbl>
#1 1     1999-03-02 2003-06-15     0.     1.     0.
#2 2     1995-11-23 2007-09-26     1.     0.     0.
#3 n     2007-02-19 2010-08-06     0.     0.     1.

示例数据

df <- read.table(text =
    "ID   start.date   end.date
1    1999-03-02   2003-06-15
2    1995-11-23   2007-09-26
n    2007-02-19   2010-08-06", header = T)

Answer 2

您可以尝试的另一个选项：

library(tidyverse)
library(lubridate)

data_frame(year = rep(1999:2009, each = nrow(df)), ID = rep(df$ID, 2009-1998)) %>%
    left_join(df, ., by = "ID") %>% 
    mutate(int = interval(parse_date_time(substring(start.date,1,4), orders = "y"), parse_date_time(substring(end.date,1,4), orders = "y"))) %>%
    mutate(val = ifelse(parse_date_time(year, orders = "y") %within% int, 1, 0)) %>% 
    spread(year, val) %>% 
    rename_at(vars(`1999`:`2009`), funs(paste0("y", .)))
#   ID start.date   end.date                            int y1999 y2000 y2001 y2002 y2003 y2004 y2005 y2006 y2007 y2008 y2009
# 1  1 1999-03-02 2003-06-15 1999-03-02 UTC--2003-06-15 UTC     1     1     1     1     1     0     0     0     0     0     0
# 2  2 1995-11-23 2007-09-26 1995-11-23 UTC--2007-09-26 UTC     1     1     1     1     1     1     1     1     1     0     0
# 3  n 2007-02-19 2010-08-06 2007-02-19 UTC--2010-08-06 UTC     0     0     0     0     0     0     0     0     1     1     1

这会设置一个时间间隔并评估年份是否在该时间间隔内。另请注意，为了方便起见，我仅将代码设置为从 1999 年到 2009 年。

Answer 3

这是另一个使用 data.table 包的选项：

library(data.table)
dat <- fread("ID   start.date   end.date
0    1990-11-23   2007-09-26
1    1999-03-02   2003-06-15
2    1995-11-23   2007-09-26
3    2007-02-19   2010-08-06
4    2005-02-19   2016-08-06")

#convert columns to Date class
cols <- names(dat)[-1L]
dat[, (cols) := lapply(.SD, as.Date, format="%Y-%m-%d"), .SDcols=cols]

#get start and end years
dat[, ':=' (startyear=year(start.date), endyear=year(end.date))]

#create a table of sequencing years to be used for joining
period <- data.table(yr=1990:2016, YEAR=1990:2016)

dcast(
    #perform a non-equi join between years sequence and dataset
    period[dat, on=.(yr >= startyear, yr <= endyear)], 
    #pivot results according to OP's request
    ID + start.date + end.date ~ YEAR, 
    length, 
    value.var="YEAR"
)

输出：

   ID start.date   end.date 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
1:  0 1990-11-23 2007-09-26    1    1    1    1    1    1    1    1    1    1    1
2:  1 1999-03-02 2003-06-15    0    0    0    0    0    0    0    0    0    1    1
3:  2 1995-11-23 2007-09-26    0    0    0    0    0    1    1    1    1    1    1
4:  3 2007-02-19 2010-08-06    0    0    0    0    0    0    0    0    0    0    0
5:  4 2005-02-19 2016-08-06    0    0    0    0    0    0    0    0    0    0    0
   2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
1:    1    1    1    1    1    1    1    0    0    0    0    0    0    0    0    0
2:    1    1    1    0    0    0    0    0    0    0    0    0    0    0    0    0
3:    1    1    1    1    1    1    1    0    0    0    0    0    0    0    0    0
4:    0    0    0    0    0    0    1    1    1    1    0    0    0    0    0    0
5:    0    0    0    0    1    1    1    1    1    1    1    1    1    1    1    1

如何找到多年是否在患者随访 R 内

How to find whether multiple years are within patient follow-up R

r

date

intervals

示例数据