创建平衡数据集

Create balanced data set

我正在使用 R 并且有一个长数据集,如下所示:

Date           ID     Status
2014-10-01     12      1
2015-04-01     12      1
2015-07-01     12      1
2015-09-01     12      1
2015-11-01     12      0
2016-01-01     12      0
2016-05-01     12      0
2016-08-01     12      1
2017-03-01     12      1
2017-05-01     12      1
2014-10-01     13      1
2015-04-01     13      1
2015-07-01     13      0
2015-11-01     14      0
2016-01-01     14      0
...

我的目标是创建一个 "balanced" 数据,即每个 ID 应该出现在 10 个日期中的每个日期。最初未发生的观察的变量 "Status" 应标记为 N/A。换句话说,结果应该是这样的:

Date           ID     Status
2014-10-01     12      1
2015-04-01     12      1
2015-07-01     12      1
2015-09-01     12      1
2015-11-01     12      0
2016-01-01     12      0
2016-05-01     12      0
2016-08-01     12      1
2017-03-01     12      1
2017-05-01     12      1
2014-10-01     13      1
2015-04-01     13      1
2015-07-01     13      N/A
2015-09-01     13      N/A
2015-11-01     13      N/A
2016-01-01     13      N/A
2016-05-01     13      N/A
2016-08-01     13      N/A
2017-03-01     13      N/A
2017-05-01     13      N/A
2014-10-01     14      N/A
2015-04-01     14      N/A
2015-07-01     14      N/A
2015-09-01     14      N/A
2015-11-01     14      0
2016-01-01     14      0
2016-05-01     14      N/A
2016-08-01     14      N/A
2017-03-01     14      N/A
2017-05-01     14      N/A
...

感谢您的帮助!

下面是一个使用 tidyverse 的方法:

library(tidyverse)
df %>%
 group_by(ID) %>%
 expand(Date) %>% #in each id expand the dates
 left_join(df) -> df1 #join the original data frame and save to object df1

或保存到原始对象(感谢 Renu 的评论):

df %<>%
 group_by(ID) %>%
 expand(Date) %>% #in each id expand the dates
 left_join(df)

等价于:

df %>%
 group_by(ID) %>%
 expand(Date) %>% #in each id expand the dates
 left_join(df) -> df

结果:

   ID       Date Status
1  12 2014-10-01      1
2  12 2015-04-01      1
3  12 2015-07-01      1
4  12 2015-09-01      1
5  12 2015-11-01      0
6  12 2016-01-01      0
7  12 2016-05-01      0
8  12 2016-08-01      1
9  12 2017-03-01      1
10 12 2017-05-01      1
11 13 2014-10-01      1
12 13 2015-04-01      1
13 13 2015-07-01      0
14 13 2015-09-01     NA
15 13 2015-11-01     NA
16 13 2016-01-01     NA
17 13 2016-05-01     NA
18 13 2016-08-01     NA
19 13 2017-03-01     NA
20 13 2017-05-01     NA
21 14 2014-10-01     NA
22 14 2015-04-01     NA
23 14 2015-07-01     NA
24 14 2015-09-01     NA
25 14 2015-11-01      0
26 14 2016-01-01      0
27 14 2016-05-01     NA
28 14 2016-08-01     NA
29 14 2017-03-01     NA
30 14 2017-05-01     NA

数据:

> dput(df)
structure(list(Date = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 1L, 2L, 3L, 5L, 6L), .Label = c("2014-10-01", "2015-04-01", 
"2015-07-01", "2015-09-01", "2015-11-01", "2016-01-01", "2016-05-01", 
"2016-08-01", "2017-03-01", "2017-05-01"), class = "factor"), 
    ID = c(12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 
    13L, 13L, 13L, 14L, 14L), Status = c(1L, 1L, 1L, 1L, 0L, 
    0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("Date", 
"ID", "Status"), class = "data.frame", row.names = c(NA, -15L
))

以下对我有用:

df_b <- data.frame(date = rep(unique(df$date), length(unique(df$id))),
               id = rep(unique(df$id), each = length(unique(df$date))))

balanced_data <- left_join(df_b, df)