创建平衡数据集
Create balanced data set
我正在使用 R 并且有一个长数据集,如下所示:
Date ID Status
2014-10-01 12 1
2015-04-01 12 1
2015-07-01 12 1
2015-09-01 12 1
2015-11-01 12 0
2016-01-01 12 0
2016-05-01 12 0
2016-08-01 12 1
2017-03-01 12 1
2017-05-01 12 1
2014-10-01 13 1
2015-04-01 13 1
2015-07-01 13 0
2015-11-01 14 0
2016-01-01 14 0
...
我的目标是创建一个 "balanced" 数据,即每个 ID 应该出现在 10 个日期中的每个日期。最初未发生的观察的变量 "Status" 应标记为 N/A。换句话说,结果应该是这样的:
Date ID Status
2014-10-01 12 1
2015-04-01 12 1
2015-07-01 12 1
2015-09-01 12 1
2015-11-01 12 0
2016-01-01 12 0
2016-05-01 12 0
2016-08-01 12 1
2017-03-01 12 1
2017-05-01 12 1
2014-10-01 13 1
2015-04-01 13 1
2015-07-01 13 N/A
2015-09-01 13 N/A
2015-11-01 13 N/A
2016-01-01 13 N/A
2016-05-01 13 N/A
2016-08-01 13 N/A
2017-03-01 13 N/A
2017-05-01 13 N/A
2014-10-01 14 N/A
2015-04-01 14 N/A
2015-07-01 14 N/A
2015-09-01 14 N/A
2015-11-01 14 0
2016-01-01 14 0
2016-05-01 14 N/A
2016-08-01 14 N/A
2017-03-01 14 N/A
2017-05-01 14 N/A
...
感谢您的帮助!
下面是一个使用 tidyverse 的方法:
library(tidyverse)
df %>%
group_by(ID) %>%
expand(Date) %>% #in each id expand the dates
left_join(df) -> df1 #join the original data frame and save to object df1
或保存到原始对象(感谢 Renu 的评论):
df %<>%
group_by(ID) %>%
expand(Date) %>% #in each id expand the dates
left_join(df)
等价于:
df %>%
group_by(ID) %>%
expand(Date) %>% #in each id expand the dates
left_join(df) -> df
结果:
ID Date Status
1 12 2014-10-01 1
2 12 2015-04-01 1
3 12 2015-07-01 1
4 12 2015-09-01 1
5 12 2015-11-01 0
6 12 2016-01-01 0
7 12 2016-05-01 0
8 12 2016-08-01 1
9 12 2017-03-01 1
10 12 2017-05-01 1
11 13 2014-10-01 1
12 13 2015-04-01 1
13 13 2015-07-01 0
14 13 2015-09-01 NA
15 13 2015-11-01 NA
16 13 2016-01-01 NA
17 13 2016-05-01 NA
18 13 2016-08-01 NA
19 13 2017-03-01 NA
20 13 2017-05-01 NA
21 14 2014-10-01 NA
22 14 2015-04-01 NA
23 14 2015-07-01 NA
24 14 2015-09-01 NA
25 14 2015-11-01 0
26 14 2016-01-01 0
27 14 2016-05-01 NA
28 14 2016-08-01 NA
29 14 2017-03-01 NA
30 14 2017-05-01 NA
数据:
> dput(df)
structure(list(Date = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 1L, 2L, 3L, 5L, 6L), .Label = c("2014-10-01", "2015-04-01",
"2015-07-01", "2015-09-01", "2015-11-01", "2016-01-01", "2016-05-01",
"2016-08-01", "2017-03-01", "2017-05-01"), class = "factor"),
ID = c(12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L,
13L, 13L, 13L, 14L, 14L), Status = c(1L, 1L, 1L, 1L, 0L,
0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("Date",
"ID", "Status"), class = "data.frame", row.names = c(NA, -15L
))
以下对我有用:
df_b <- data.frame(date = rep(unique(df$date), length(unique(df$id))),
id = rep(unique(df$id), each = length(unique(df$date))))
balanced_data <- left_join(df_b, df)
我正在使用 R 并且有一个长数据集,如下所示:
Date ID Status
2014-10-01 12 1
2015-04-01 12 1
2015-07-01 12 1
2015-09-01 12 1
2015-11-01 12 0
2016-01-01 12 0
2016-05-01 12 0
2016-08-01 12 1
2017-03-01 12 1
2017-05-01 12 1
2014-10-01 13 1
2015-04-01 13 1
2015-07-01 13 0
2015-11-01 14 0
2016-01-01 14 0
...
我的目标是创建一个 "balanced" 数据,即每个 ID 应该出现在 10 个日期中的每个日期。最初未发生的观察的变量 "Status" 应标记为 N/A。换句话说,结果应该是这样的:
Date ID Status
2014-10-01 12 1
2015-04-01 12 1
2015-07-01 12 1
2015-09-01 12 1
2015-11-01 12 0
2016-01-01 12 0
2016-05-01 12 0
2016-08-01 12 1
2017-03-01 12 1
2017-05-01 12 1
2014-10-01 13 1
2015-04-01 13 1
2015-07-01 13 N/A
2015-09-01 13 N/A
2015-11-01 13 N/A
2016-01-01 13 N/A
2016-05-01 13 N/A
2016-08-01 13 N/A
2017-03-01 13 N/A
2017-05-01 13 N/A
2014-10-01 14 N/A
2015-04-01 14 N/A
2015-07-01 14 N/A
2015-09-01 14 N/A
2015-11-01 14 0
2016-01-01 14 0
2016-05-01 14 N/A
2016-08-01 14 N/A
2017-03-01 14 N/A
2017-05-01 14 N/A
...
感谢您的帮助!
下面是一个使用 tidyverse 的方法:
library(tidyverse)
df %>%
group_by(ID) %>%
expand(Date) %>% #in each id expand the dates
left_join(df) -> df1 #join the original data frame and save to object df1
或保存到原始对象(感谢 Renu 的评论):
df %<>%
group_by(ID) %>%
expand(Date) %>% #in each id expand the dates
left_join(df)
等价于:
df %>%
group_by(ID) %>%
expand(Date) %>% #in each id expand the dates
left_join(df) -> df
结果:
ID Date Status
1 12 2014-10-01 1
2 12 2015-04-01 1
3 12 2015-07-01 1
4 12 2015-09-01 1
5 12 2015-11-01 0
6 12 2016-01-01 0
7 12 2016-05-01 0
8 12 2016-08-01 1
9 12 2017-03-01 1
10 12 2017-05-01 1
11 13 2014-10-01 1
12 13 2015-04-01 1
13 13 2015-07-01 0
14 13 2015-09-01 NA
15 13 2015-11-01 NA
16 13 2016-01-01 NA
17 13 2016-05-01 NA
18 13 2016-08-01 NA
19 13 2017-03-01 NA
20 13 2017-05-01 NA
21 14 2014-10-01 NA
22 14 2015-04-01 NA
23 14 2015-07-01 NA
24 14 2015-09-01 NA
25 14 2015-11-01 0
26 14 2016-01-01 0
27 14 2016-05-01 NA
28 14 2016-08-01 NA
29 14 2017-03-01 NA
30 14 2017-05-01 NA
数据:
> dput(df)
structure(list(Date = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 1L, 2L, 3L, 5L, 6L), .Label = c("2014-10-01", "2015-04-01",
"2015-07-01", "2015-09-01", "2015-11-01", "2016-01-01", "2016-05-01",
"2016-08-01", "2017-03-01", "2017-05-01"), class = "factor"),
ID = c(12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L,
13L, 13L, 13L, 14L, 14L), Status = c(1L, 1L, 1L, 1L, 0L,
0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L)), .Names = c("Date",
"ID", "Status"), class = "data.frame", row.names = c(NA, -15L
))
以下对我有用:
df_b <- data.frame(date = rep(unique(df$date), length(unique(df$id))),
id = rep(unique(df$id), each = length(unique(df$date))))
balanced_data <- left_join(df_b, df)