转化为平衡面板数据
Transform into balanced panel data
我有一个不平衡的面板,如下例所示:
test <- read.table(
text = "
A 2010-01-01 1 rdm
A 2010-01-10 2 dfg
A 2010-01-14 3 fdgfd
A 2010-02-15 4 fdgfd
A 2010-08-17 5 dg
A 2010-12-19 6 dfg
B 2009-01-01 1 dfg
B 2010-01-01 2 ydg
B 2010-01-10 3 fdgfd
B 2010-01-14 4 dfg
B 2010-02-15 5 dfg
",header=F)
library(data.table)
setDT(test)
names(test) <- c("ID", "date", "nr", "namecol")
我想在日期方面进行平衡,即每个人(A、B 等)在没有数据的日期都有 NA 行。我不知道每组的最短日期或跨组的最短日期。与最大值相同,但仅选择等于特定日期的最大值(与跨组计算相比)可能更快。
期望的输出是:
out <- read.table(
text = "
A 2009-01-01 NA NA
A 2010-01-01 1 rdm
A 2010-01-10 2 dfg
A 2010-01-14 3 fdgfd
A 2010-02-15 4 fdgfd
A 2010-08-17 5 dg
A 2010-12-19 6 dfg
B 2009-01-01 1 dfg
B 2010-01-01 2 ydg
B 2010-01-10 3 fdgfd
B 2010-01-14 4 dfg
B 2010-02-15 5 dfg
B 2010-08-17 NA NA
B 2010-12-19 NA NA
",header=F)
setDT(out)
names(out) <- c("ID", "date", "nr", "namecol")
我的数据集非常大,所以我认为最好在 data.table
(或 plyr
、reshape2
)或类似的合适的地方执行此操作。
在将 key
列设置为 'ID' 和 'date',然后对原始数据集执行 join
。
setDT(test, key = c("ID", "date"))[CJ(ID, date, unique=TRUE)]
# ID date nr namecol
# 1: A 2009-01-01 NA NA
# 2: A 2010-01-01 1 rdm
# 3: A 2010-01-10 2 dfg
# 4: A 2010-01-14 3 fdgfd
# 5: A 2010-02-15 4 fdgfd
# 6: A 2010-08-17 5 dg
# 7: A 2010-12-19 6 dfg
# 8: B 2009-01-01 1 dfg
# 9: B 2010-01-01 2 ydg
#10: B 2010-01-10 3 fdgfd
#11: B 2010-01-14 4 dfg
#12: B 2010-02-15 5 dfg
#13: B 2010-08-17 NA NA
#14: B 2010-12-19 NA NA
数据
test <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B"), date = structure(c(14610, 14619, 14623, 14655,
14838, 14962, 14245, 14610, 14619, 14623, 14655), class = "Date"),
nr = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L), namecol = c("rdm",
"dfg", "fdgfd", "fdgfd", "dg", "dfg", "dfg", "ydg", "fdgfd",
"dfg", "dfg")), .Names = c("ID", "date", "nr", "namecol"),
row.names = c(NA, -11L), class = "data.frame")
我有一个不平衡的面板,如下例所示:
test <- read.table(
text = "
A 2010-01-01 1 rdm
A 2010-01-10 2 dfg
A 2010-01-14 3 fdgfd
A 2010-02-15 4 fdgfd
A 2010-08-17 5 dg
A 2010-12-19 6 dfg
B 2009-01-01 1 dfg
B 2010-01-01 2 ydg
B 2010-01-10 3 fdgfd
B 2010-01-14 4 dfg
B 2010-02-15 5 dfg
",header=F)
library(data.table)
setDT(test)
names(test) <- c("ID", "date", "nr", "namecol")
我想在日期方面进行平衡,即每个人(A、B 等)在没有数据的日期都有 NA 行。我不知道每组的最短日期或跨组的最短日期。与最大值相同,但仅选择等于特定日期的最大值(与跨组计算相比)可能更快。 期望的输出是:
out <- read.table(
text = "
A 2009-01-01 NA NA
A 2010-01-01 1 rdm
A 2010-01-10 2 dfg
A 2010-01-14 3 fdgfd
A 2010-02-15 4 fdgfd
A 2010-08-17 5 dg
A 2010-12-19 6 dfg
B 2009-01-01 1 dfg
B 2010-01-01 2 ydg
B 2010-01-10 3 fdgfd
B 2010-01-14 4 dfg
B 2010-02-15 5 dfg
B 2010-08-17 NA NA
B 2010-12-19 NA NA
",header=F)
setDT(out)
names(out) <- c("ID", "date", "nr", "namecol")
我的数据集非常大,所以我认为最好在 data.table
(或 plyr
、reshape2
)或类似的合适的地方执行此操作。
在将 key
列设置为 'ID' 和 'date',然后对原始数据集执行 join
。
setDT(test, key = c("ID", "date"))[CJ(ID, date, unique=TRUE)]
# ID date nr namecol
# 1: A 2009-01-01 NA NA
# 2: A 2010-01-01 1 rdm
# 3: A 2010-01-10 2 dfg
# 4: A 2010-01-14 3 fdgfd
# 5: A 2010-02-15 4 fdgfd
# 6: A 2010-08-17 5 dg
# 7: A 2010-12-19 6 dfg
# 8: B 2009-01-01 1 dfg
# 9: B 2010-01-01 2 ydg
#10: B 2010-01-10 3 fdgfd
#11: B 2010-01-14 4 dfg
#12: B 2010-02-15 5 dfg
#13: B 2010-08-17 NA NA
#14: B 2010-12-19 NA NA
数据
test <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B"), date = structure(c(14610, 14619, 14623, 14655,
14838, 14962, 14245, 14610, 14619, 14623, 14655), class = "Date"),
nr = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L), namecol = c("rdm",
"dfg", "fdgfd", "fdgfd", "dg", "dfg", "dfg", "ydg", "fdgfd",
"dfg", "dfg")), .Names = c("ID", "date", "nr", "namecol"),
row.names = c(NA, -11L), class = "data.frame")