如何找到多年是否在患者随访 R 内
How to find whether multiple years are within patient follow-up R
如果问题在技术上是重复的,请随时删除它,但我看过很多类似的答案,none 其中的答案适用于我的数据。
我有患者的随访数据,像这样:
ID start.date end.date
1 1999-03-02 2003-06-15
2 1995-11-23 2007-09-26
..
..
n 2007-02-19 2010-08-06
这很简单,我有超过 400 万个 ID。
我试图找出从 1990 年到 2016 年每年注册了多少个 id,以便计算每年的发病率(省略了疾病状态列)。我想要如下数据集:
ID start.date end.date y1990 ... y1995 .. y2000 .. y2005 .. y2016
1 1999-03-02 2003-06-15 0 0 1 0 0
2 1990-11-23 2007-09-26 1 1 1 1 0
..
..
n 2005-02-19 2016-08-06 0 0 0 1 1
如果该患者在那一年仍然 "registered",则每列的值为 1,否则为 0。
作为旁注,如果有人知道 R 中的一个包可以计算分层发生率,那会更好,但就目前而言,我无法让他们中的任何一个做我想让他们做的事。
我尝试了 data.table、lubridate 和 dplyr 的各种解决方案,但都无济于事。将不胜感激。
你可以这样做:
library(tidyverse)
df %>%
mutate(year = as.numeric(sub("-\d+-\d+$", "", start.date))) %>%
group_by(ID) %>%
mutate(n = 1:n()) %>%
spread(year, n, fill = 0)
## A tibble: 3 x 6
## Groups: ID [3]
# ID start.date end.date `1995` `1999` `2007`
# <fct> <fct> <fct> <dbl> <dbl> <dbl>
#1 1 1999-03-02 2003-06-15 0. 1. 0.
#2 2 1995-11-23 2007-09-26 1. 0. 0.
#3 n 2007-02-19 2010-08-06 0. 0. 1.
示例数据
df <- read.table(text =
"ID start.date end.date
1 1999-03-02 2003-06-15
2 1995-11-23 2007-09-26
n 2007-02-19 2010-08-06", header = T)
您可以尝试的另一个选项:
library(tidyverse)
library(lubridate)
data_frame(year = rep(1999:2009, each = nrow(df)), ID = rep(df$ID, 2009-1998)) %>%
left_join(df, ., by = "ID") %>%
mutate(int = interval(parse_date_time(substring(start.date,1,4), orders = "y"), parse_date_time(substring(end.date,1,4), orders = "y"))) %>%
mutate(val = ifelse(parse_date_time(year, orders = "y") %within% int, 1, 0)) %>%
spread(year, val) %>%
rename_at(vars(`1999`:`2009`), funs(paste0("y", .)))
# ID start.date end.date int y1999 y2000 y2001 y2002 y2003 y2004 y2005 y2006 y2007 y2008 y2009
# 1 1 1999-03-02 2003-06-15 1999-03-02 UTC--2003-06-15 UTC 1 1 1 1 1 0 0 0 0 0 0
# 2 2 1995-11-23 2007-09-26 1995-11-23 UTC--2007-09-26 UTC 1 1 1 1 1 1 1 1 1 0 0
# 3 n 2007-02-19 2010-08-06 2007-02-19 UTC--2010-08-06 UTC 0 0 0 0 0 0 0 0 1 1 1
这会设置一个时间间隔并评估年份是否在该时间间隔内。另请注意,为了方便起见,我仅将代码设置为从 1999 年到 2009 年。
这是另一个使用 data.table 包的选项:
library(data.table)
dat <- fread("ID start.date end.date
0 1990-11-23 2007-09-26
1 1999-03-02 2003-06-15
2 1995-11-23 2007-09-26
3 2007-02-19 2010-08-06
4 2005-02-19 2016-08-06")
#convert columns to Date class
cols <- names(dat)[-1L]
dat[, (cols) := lapply(.SD, as.Date, format="%Y-%m-%d"), .SDcols=cols]
#get start and end years
dat[, ':=' (startyear=year(start.date), endyear=year(end.date))]
#create a table of sequencing years to be used for joining
period <- data.table(yr=1990:2016, YEAR=1990:2016)
dcast(
#perform a non-equi join between years sequence and dataset
period[dat, on=.(yr >= startyear, yr <= endyear)],
#pivot results according to OP's request
ID + start.date + end.date ~ YEAR,
length,
value.var="YEAR"
)
输出:
ID start.date end.date 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
1: 0 1990-11-23 2007-09-26 1 1 1 1 1 1 1 1 1 1 1
2: 1 1999-03-02 2003-06-15 0 0 0 0 0 0 0 0 0 1 1
3: 2 1995-11-23 2007-09-26 0 0 0 0 0 1 1 1 1 1 1
4: 3 2007-02-19 2010-08-06 0 0 0 0 0 0 0 0 0 0 0
5: 4 2005-02-19 2016-08-06 0 0 0 0 0 0 0 0 0 0 0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
1: 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
2: 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3: 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
4: 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
5: 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
如果问题在技术上是重复的,请随时删除它,但我看过很多类似的答案,none 其中的答案适用于我的数据。
我有患者的随访数据,像这样:
ID start.date end.date
1 1999-03-02 2003-06-15
2 1995-11-23 2007-09-26
..
..
n 2007-02-19 2010-08-06
这很简单,我有超过 400 万个 ID。
我试图找出从 1990 年到 2016 年每年注册了多少个 id,以便计算每年的发病率(省略了疾病状态列)。我想要如下数据集:
ID start.date end.date y1990 ... y1995 .. y2000 .. y2005 .. y2016
1 1999-03-02 2003-06-15 0 0 1 0 0
2 1990-11-23 2007-09-26 1 1 1 1 0
..
..
n 2005-02-19 2016-08-06 0 0 0 1 1
如果该患者在那一年仍然 "registered",则每列的值为 1,否则为 0。
作为旁注,如果有人知道 R 中的一个包可以计算分层发生率,那会更好,但就目前而言,我无法让他们中的任何一个做我想让他们做的事。
我尝试了 data.table、lubridate 和 dplyr 的各种解决方案,但都无济于事。将不胜感激。
你可以这样做:
library(tidyverse)
df %>%
mutate(year = as.numeric(sub("-\d+-\d+$", "", start.date))) %>%
group_by(ID) %>%
mutate(n = 1:n()) %>%
spread(year, n, fill = 0)
## A tibble: 3 x 6
## Groups: ID [3]
# ID start.date end.date `1995` `1999` `2007`
# <fct> <fct> <fct> <dbl> <dbl> <dbl>
#1 1 1999-03-02 2003-06-15 0. 1. 0.
#2 2 1995-11-23 2007-09-26 1. 0. 0.
#3 n 2007-02-19 2010-08-06 0. 0. 1.
示例数据
df <- read.table(text =
"ID start.date end.date
1 1999-03-02 2003-06-15
2 1995-11-23 2007-09-26
n 2007-02-19 2010-08-06", header = T)
您可以尝试的另一个选项:
library(tidyverse)
library(lubridate)
data_frame(year = rep(1999:2009, each = nrow(df)), ID = rep(df$ID, 2009-1998)) %>%
left_join(df, ., by = "ID") %>%
mutate(int = interval(parse_date_time(substring(start.date,1,4), orders = "y"), parse_date_time(substring(end.date,1,4), orders = "y"))) %>%
mutate(val = ifelse(parse_date_time(year, orders = "y") %within% int, 1, 0)) %>%
spread(year, val) %>%
rename_at(vars(`1999`:`2009`), funs(paste0("y", .)))
# ID start.date end.date int y1999 y2000 y2001 y2002 y2003 y2004 y2005 y2006 y2007 y2008 y2009
# 1 1 1999-03-02 2003-06-15 1999-03-02 UTC--2003-06-15 UTC 1 1 1 1 1 0 0 0 0 0 0
# 2 2 1995-11-23 2007-09-26 1995-11-23 UTC--2007-09-26 UTC 1 1 1 1 1 1 1 1 1 0 0
# 3 n 2007-02-19 2010-08-06 2007-02-19 UTC--2010-08-06 UTC 0 0 0 0 0 0 0 0 1 1 1
这会设置一个时间间隔并评估年份是否在该时间间隔内。另请注意,为了方便起见,我仅将代码设置为从 1999 年到 2009 年。
这是另一个使用 data.table 包的选项:
library(data.table)
dat <- fread("ID start.date end.date
0 1990-11-23 2007-09-26
1 1999-03-02 2003-06-15
2 1995-11-23 2007-09-26
3 2007-02-19 2010-08-06
4 2005-02-19 2016-08-06")
#convert columns to Date class
cols <- names(dat)[-1L]
dat[, (cols) := lapply(.SD, as.Date, format="%Y-%m-%d"), .SDcols=cols]
#get start and end years
dat[, ':=' (startyear=year(start.date), endyear=year(end.date))]
#create a table of sequencing years to be used for joining
period <- data.table(yr=1990:2016, YEAR=1990:2016)
dcast(
#perform a non-equi join between years sequence and dataset
period[dat, on=.(yr >= startyear, yr <= endyear)],
#pivot results according to OP's request
ID + start.date + end.date ~ YEAR,
length,
value.var="YEAR"
)
输出:
ID start.date end.date 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
1: 0 1990-11-23 2007-09-26 1 1 1 1 1 1 1 1 1 1 1
2: 1 1999-03-02 2003-06-15 0 0 0 0 0 0 0 0 0 1 1
3: 2 1995-11-23 2007-09-26 0 0 0 0 0 1 1 1 1 1 1
4: 3 2007-02-19 2010-08-06 0 0 0 0 0 0 0 0 0 0 0
5: 4 2005-02-19 2016-08-06 0 0 0 0 0 0 0 0 0 0 0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
1: 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
2: 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3: 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
4: 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
5: 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1