将季度数据分解为 R 保持值中的每日数据?
Disaggregate quarterly data to daily data in R keeping values?
如何轻松地将季度数据分解为每日数据?在这种情况下,我使用了 10 年的美国 GDP 数据,这些数据具有季度观测值,我想将数据框扩展到每日级别,将每天的 GDP 值结转到下一次观测值。
Reprex table:
structure(list(thedate = structure(c(14426, 14518, 14610, 14700,
14791, 14883, 14975, 15065, 15156, 15248, 15340, 15431, 15522,
15614, 15706, 15796, 15887, 15979, 16071, 16161, 16252, 16344,
16436, 16526, 16617, 16709, 16801, 16892, 16983, 17075, 17167,
17257, 17348, 17440, 17532, 17622, 17713, 17805, 17897, 17987
), class = "Date"), gdp = c(1.5, 4.5, 1.5, 3.7, 3, 2, -1, 2.9,
-0.1, 4.7, 3.2, 1.7, 0.5, 0.5, 3.6, 0.5, 3.2, 3.2, -1.1, 5.5,
5, 2.3, 3.2, 3, 1.3, 0.1, 2, 1.9, 2.2, 2, 2.3, 2.2, 3.2, 3.5,
2.5, 3.5, 2.9, 1.1, 3.1, 2.1)), class = "data.frame", row.names = c(NA,
-40L))
我们看到上面:
2009-07-01 | 1.5
2009-10-01 | 4.5
预期输出如下:
2009-07-01 | 1.5
2009-07-02 | 1.5
2009-07-03 | 1.5
etc.
2009-10-01 | 4.5
2009-10-02 | 4.5
2009-10-03 | 4.5
这是一个基本的解决方案:
last_quarter_end_date <- seq.Date(df$thedate[nrow(df)], by = 'quarter', length.out = 2)[-1]-1
seqs <- diff(c(df$thedate, last_quarter_end_date))
data.frame(thedate = rep(df$thedate, seqs) + sequence(seqs)-1
, gdp = rep(df$gdp, seqs))
基本上,日期之间的差异是您需要重复 GDP 列的次数。此外,我可以为每个差异做 seq_len()
以添加回原始日期。
性能
这种方法很有效,但我会注意到 0.6 毫秒与大图上的 15 毫秒并没有太大区别。
Unit: microseconds
expr min lq mean median uq max neval
cole_base 528.1 554.15 690.379 644.9 663.75 3225.7 100
d_b_base 15735.0 15994.40 17395.754 16243.9 18108.30 38761.8 100
Ben_tidyr 2808.7 2936.40 3356.324 3076.6 3149.65 8065.1 100
完整代码供参考:
DF <- structure(list(thedate = structure(c(14426, 14518, 14610, 14700,
14791, 14883, 14975, 15065, 15156, 15248, 15340, 15431, 15522,
15614, 15706, 15796, 15887, 15979, 16071, 16161, 16252, 16344,
16436, 16526, 16617, 16709, 16801, 16892, 16983, 17075, 17167,
17257, 17348, 17440, 17532, 17622, 17713, 17805, 17897, 17987
), class = "Date"), gdp = c(1.5, 4.5, 1.5, 3.7, 3, 2, -1, 2.9,
-0.1, 4.7, 3.2, 1.7, 0.5, 0.5, 3.6, 0.5, 3.2, 3.2, -1.1, 5.5,
5, 2.3, 3.2, 3, 1.3, 0.1, 2, 1.9, 2.2, 2, 2.3, 2.2, 3.2, 3.5,
2.5, 3.5, 2.9, 1.1, 3.1, 2.1)), class = "data.frame", row.names = c(NA,
-40L))
library(microbenchmark)
library(tidyr)
microbenchmark(cole_base = {
last_quarter_end_date <- seq.Date(DF$thedate[nrow(DF)], by = 'quarter', length.out = 2)[-1]-1
seqs <- diff(c(DF$thedate, last_quarter_end_date))
data.frame(thedate = rep(DF$thedate, seqs) + sequence(seqs)-1
, gdp = rep(DF$gdp, seqs))
}
, d_b_base = {
do.call(rbind, lapply(2:NROW(DF), function(i){
data.frame(date = head(seq.Date(DF$thedate[i-1], DF$thedate[i], "days"), -1),
gdp = DF$gdp[i - 1])
}))
}
, Ben_tidyr = {
DF %>%
complete(thedate = seq.Date(min(thedate), max(thedate), by="day")) %>%
fill(gdp)
}
)
这是一个 tidyr 和 zoo 包答案,它在插入带有 NA 的日期序列后使用 'last observation carried forward':
library(tidyverse)
library(zoo)
data %>%
complete(thedate = seq.Date(min(thedate), max(thedate), by="day")) %>%
do(na.locf(.))
编辑:感谢 Shree 提醒 tidyr::fill 将消除对动物园的需求:
library(tidyverse)
data %>%
complete(thedate = seq.Date(min(thedate), max(thedate), by="day")) %>%
fill(gdp)
如何轻松地将季度数据分解为每日数据?在这种情况下,我使用了 10 年的美国 GDP 数据,这些数据具有季度观测值,我想将数据框扩展到每日级别,将每天的 GDP 值结转到下一次观测值。
Reprex table:
structure(list(thedate = structure(c(14426, 14518, 14610, 14700,
14791, 14883, 14975, 15065, 15156, 15248, 15340, 15431, 15522,
15614, 15706, 15796, 15887, 15979, 16071, 16161, 16252, 16344,
16436, 16526, 16617, 16709, 16801, 16892, 16983, 17075, 17167,
17257, 17348, 17440, 17532, 17622, 17713, 17805, 17897, 17987
), class = "Date"), gdp = c(1.5, 4.5, 1.5, 3.7, 3, 2, -1, 2.9,
-0.1, 4.7, 3.2, 1.7, 0.5, 0.5, 3.6, 0.5, 3.2, 3.2, -1.1, 5.5,
5, 2.3, 3.2, 3, 1.3, 0.1, 2, 1.9, 2.2, 2, 2.3, 2.2, 3.2, 3.5,
2.5, 3.5, 2.9, 1.1, 3.1, 2.1)), class = "data.frame", row.names = c(NA,
-40L))
我们看到上面:
2009-07-01 | 1.5
2009-10-01 | 4.5
预期输出如下:
2009-07-01 | 1.5
2009-07-02 | 1.5
2009-07-03 | 1.5
etc.
2009-10-01 | 4.5
2009-10-02 | 4.5
2009-10-03 | 4.5
这是一个基本的解决方案:
last_quarter_end_date <- seq.Date(df$thedate[nrow(df)], by = 'quarter', length.out = 2)[-1]-1
seqs <- diff(c(df$thedate, last_quarter_end_date))
data.frame(thedate = rep(df$thedate, seqs) + sequence(seqs)-1
, gdp = rep(df$gdp, seqs))
基本上,日期之间的差异是您需要重复 GDP 列的次数。此外,我可以为每个差异做 seq_len()
以添加回原始日期。
性能 这种方法很有效,但我会注意到 0.6 毫秒与大图上的 15 毫秒并没有太大区别。
Unit: microseconds
expr min lq mean median uq max neval
cole_base 528.1 554.15 690.379 644.9 663.75 3225.7 100
d_b_base 15735.0 15994.40 17395.754 16243.9 18108.30 38761.8 100
Ben_tidyr 2808.7 2936.40 3356.324 3076.6 3149.65 8065.1 100
完整代码供参考:
DF <- structure(list(thedate = structure(c(14426, 14518, 14610, 14700,
14791, 14883, 14975, 15065, 15156, 15248, 15340, 15431, 15522,
15614, 15706, 15796, 15887, 15979, 16071, 16161, 16252, 16344,
16436, 16526, 16617, 16709, 16801, 16892, 16983, 17075, 17167,
17257, 17348, 17440, 17532, 17622, 17713, 17805, 17897, 17987
), class = "Date"), gdp = c(1.5, 4.5, 1.5, 3.7, 3, 2, -1, 2.9,
-0.1, 4.7, 3.2, 1.7, 0.5, 0.5, 3.6, 0.5, 3.2, 3.2, -1.1, 5.5,
5, 2.3, 3.2, 3, 1.3, 0.1, 2, 1.9, 2.2, 2, 2.3, 2.2, 3.2, 3.5,
2.5, 3.5, 2.9, 1.1, 3.1, 2.1)), class = "data.frame", row.names = c(NA,
-40L))
library(microbenchmark)
library(tidyr)
microbenchmark(cole_base = {
last_quarter_end_date <- seq.Date(DF$thedate[nrow(DF)], by = 'quarter', length.out = 2)[-1]-1
seqs <- diff(c(DF$thedate, last_quarter_end_date))
data.frame(thedate = rep(DF$thedate, seqs) + sequence(seqs)-1
, gdp = rep(DF$gdp, seqs))
}
, d_b_base = {
do.call(rbind, lapply(2:NROW(DF), function(i){
data.frame(date = head(seq.Date(DF$thedate[i-1], DF$thedate[i], "days"), -1),
gdp = DF$gdp[i - 1])
}))
}
, Ben_tidyr = {
DF %>%
complete(thedate = seq.Date(min(thedate), max(thedate), by="day")) %>%
fill(gdp)
}
)
这是一个 tidyr 和 zoo 包答案,它在插入带有 NA 的日期序列后使用 'last observation carried forward':
library(tidyverse)
library(zoo)
data %>%
complete(thedate = seq.Date(min(thedate), max(thedate), by="day")) %>%
do(na.locf(.))
编辑:感谢 Shree 提醒 tidyr::fill 将消除对动物园的需求:
library(tidyverse)
data %>%
complete(thedate = seq.Date(min(thedate), max(thedate), by="day")) %>%
fill(gdp)