根据不同日期变量的观察值的存在或缺失生成新的因子变量

Question

我有以下数据

R码

df <- data.frame(idnum = c(1001, 1002, 1003, 1004),
             date1 = c(2003-03-19, 2003-03-19, 2003-03-19, 2003-03-19),
             date2 = c(2004-03-24, NA, 2004-03-25, 2004-03-26),
             date3 = c(2005-05-11, 2005-05-12, 2005-05-12, NA))

并且想做一些我在 Stata 中做过的事情

统计代码

gen xvisit=1 if date1 !=. & date2 !=. & date3!=.
replace xvisit=2 if date1 !=. & date2 !=. & date3 ==. 
replace xvisit=3 if date1 !=. & date2 ==. & date3 !=.
replace xvisit=4 if date1 !=. & date2 ==. & date3 ==.
label define xvisit 1 "All" 2 "Baseline & 2nd" 3 "Baseline & 3rd" 4 "Baseline only"
label values xvisit xvisit

但我就是无法在 R 中做到正确。我的目标是得到类似

的东西

Stata Output

在此处输入图片描述ription

Answer 1

您可以使用 within 来完成此操作。在 R 中，我们开始使用 NA 初始化 xvisit 列。要按行查找所有日期都丢失的地方，我们可以有效地使用否定 is.na 检查的 rowSums——当它是 3 时我们得到它。其他行应该是不言自明的。最后我们创建一个factor，其中levels=对应数值，labels=按照级别顺序。

您可能还需要考虑事先使用 as.Date 将字符日期转换为日期格式。

df[c("date1", "date2", "date3")] <- lapply(df[c("date1", "date2", "date3")], as.Date)

df <- within(df, {
  xvisit <- NA
  xvisit[rowSums(!is.na(df[c("date1", "date2", "date3")])) == 3] <- 1
  xvisit[is.na(df[["date3"]])] <- 2
  xvisit[is.na(df[["date2"]])] <- 3
  xvisit[is.na(df[["date1"]])] <- 4
  xvisit <- factor(xvisit, levels=1:4, 
                   labels=c("All", "Baseline & 2nd", "Baseline & 3rd", "Baseline only"))
})
df
#   idnum      date1      date2      date3         xvisit
# 1  1001 2003-03-19 2004-03-24 2005-05-11            All
# 2  1002 2003-03-19       <NA> 2005-05-12 Baseline & 3rd
# 3  1003 2003-03-19 2004-03-25 2005-05-12            All
# 4  1004 2003-03-19 2004-03-26       <NA> Baseline & 2nd

数据：

df <- structure(list(idnum = c(1001, 1002, 1003, 1004), date1 = c("2003-03-19", 
"2003-03-19", "2003-03-19", "2003-03-19"), date2 = c("2004-03-24", 
NA, "2004-03-25", "2004-03-26"), date3 = c("2005-05-11", "2005-05-12", 
"2005-05-12", NA)), class = "data.frame", row.names = c(NA, -4L
))

根据不同日期变量的观察值的存在或缺失生成新的因子变量

Generate a new factor variable based on the presence or missing of observations of different date variable

r

dataframe

stata

dplyr