移动块的覆盖概率问题bootstrap
Coverage probability problem for moving block bootstrap
我将移动块 bootstrap (MBB) 应用于使用时间序列数据的回归模型。当我计算来自 MBB 的估计量的覆盖概率时,结果是异常的,除了一个系数(x1 的系数被设置为连续变量)。鉴于 MBB 是一种行之有效的方法(参见 https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.713.1262&rep=rep1&type=pdf and https://en.wikipedia.org/wiki/Bootstrapping_(statistics)),我想知道我的代码是否有问题。感谢任何意见!
set.seed(63)
#create a function to generate time series data
tsfunc3 <- function (size=30, ar=0.7) {
ar.epsilon <- arima.sim(list(order = c(1,0,0), ar = 0.7), n = size, sd=2)
x1=rnorm(size)
x2=sample(1:5, size, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
x3=rbinom(size, 1, 0.5)
y=as.numeric(5 + 0.25*x1 + 0.4*x2 + 0.8*x3 + ar.epsilon) #A combination of continuous
#predictor x1, ordinal predictor
#x2 and binary predictor x3
data.frame(time=1:size, x1=x1, x2=x2, x3=x3, y=y)}
#A time series
tdat <- tsfunc3()
# Block length derived from the data based on the approach proposed by Politis & White
#(2003):
b <- 3
#Initial values
#blocks=tdat[1:3,c(2,3,4,5)]
n <- 30
#A sequence of blocks
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))
#MBB for intercept estimator
IntMbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "(Intercept)", level = 0.95)
}
#MBB for x1 coefficient estimator
B1Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x1", level = 0.95)
}
#MBB for x2 coefficient estimator
B2Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x2", level = 0.95)
}
#MBB for x3 coefficient estimator
B3Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x3", level = 0.95)
}
#Replications
set.seed(47)
R <- 100
int.mbb <- replicate(R, IntMbb(), simplify=FALSE)
b1.mbb <- replicate(R, B1Mbb(), simplify=FALSE)
b2.mbb <- replicate(R, B2Mbb(), simplify=FALSE)
b3.mbb <- replicate(R, B3Mbb(), simplify=FALSE)
#Calculate coverage probability for intercept estimator
int.ci <- t(sapply(int.mbb, function(x, y) x[grep(y, rownames(x)), ], "Intercept"))
sum(int.ci[,"2.5 %"] <=5 & 5 <= int.ci[,"97.5 %"])/R
[1] 0.34
#Calculate coverage probability for x1 coefficient estimator
int.ci <- t(sapply(b1.mbb, function(x, y) x[grep(y, rownames(x)), ], "x1"))
sum(int.ci[,"2.5 %"] <=0.25 & 0.25 <= int.ci[,"97.5 %"])/R
[1] 0.9
#Calculate coverage probability for x2 coefficient estimator
int.ci <- t(sapply(b2.mbb, function(x, y) x[grep(y, rownames(x)), ], "x2"))
sum(int.ci[,"2.5 %"] <=0.4 & 0.4 <= int.ci[,"97.5 %"])/R
[1] 0.38
#Calculate coverage probability for x3 coefficient estimator
int.ci <- t(sapply(b3.mbb, function(x, y) x[grep(y, rownames(x)), ], "x3"))
sum(int.ci[,"2.5 %"] <=0.8 & 0.8 <= int.ci[,"97.5 %"])/R
[1] 0.33
如您所见,只有 x1 系数估计器的覆盖概率是可以的。那么我的代码有什么问题吗?还是跟MBB本身有关系?
您并没有真正评估 bootstrap 的覆盖概率。您需要根据 bootstrapped 统计数据建立置信区间,而不是根据 bootstrapped 样本的参数模型 运行 建立置信区间。这是我的做法。
首先,我们可以生成数据:
set.seed(45301)
b <- 3
n <- 30
nblocks <- ceiling(n/b)
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))
#A time series
tdat <- tsfunc3(size=n, ar=.7)
接下来,我们可以编写一个我们将 bootstrap 的函数。此函数生成 bootstrap 样本,运行 回归并保存系数。
bsfun <- function(data, blocks){
samp.data <- data[sample(1:length(blocks), length(blocks), replace=TRUE), ]
mod <- lm(y ~ x1 + x2 + x3, data=samp.data)
coef(mod)
}
接下来,我们可以 运行 函数很多次。请注意,要生成可靠的 95% 百分位数置信区间,您应该拥有 1500-2500 bootstrap 附近的统计数据。您尝试表征的分位数越远,您需要的 bootstrap 个样本就越多。因此,下面的代码生成一组 bootstrap 系数:
out <- t(replicate(1000, bsfun(data=tdat, blocks=blocks)))
根据这组 bootstrap 统计数据,我们可以得出一个置信区间。
ci1 <- t(apply(out, 2, quantile, probs=c(.025,.975), na.rm=TRUE))
# 2.5% 97.5%
# (Intercept) -0.3302237 10.258229
# x1 -1.7577214 2.301975
# x2 -0.8016478 2.049435
# x3 -3.0723869 6.190383
如果你想调查这些区间的覆盖概率,你将不得不做我上面做的很多次(我们会做 100 次,尽管为了得到更好的估计,你可能想做更多).然后我们可以编写一个小函数来评估一组估计的覆盖率:
eval_cover <- function(true = c(5,.25,.4, .8), obs){
out <- as.numeric(obs[,1] < true & obs[,2] > true)
names(out) <- rownames(obs)
out
}
然后,您可以将该函数应用于您生成的每个 bootstrap 置信区间。使用 rowMeans()
函数将获得覆盖 1/0 值的平均值,这将是覆盖概率。在这种情况下,仅使用 100 个间隔,覆盖率为 100%。
rowMeans(sapply(outci, function(x)eval_cover(obs=x)))
# (Intercept) x1 x2 x3
# 1 1 1 1
我将移动块 bootstrap (MBB) 应用于使用时间序列数据的回归模型。当我计算来自 MBB 的估计量的覆盖概率时,结果是异常的,除了一个系数(x1 的系数被设置为连续变量)。鉴于 MBB 是一种行之有效的方法(参见 https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.713.1262&rep=rep1&type=pdf and https://en.wikipedia.org/wiki/Bootstrapping_(statistics)),我想知道我的代码是否有问题。感谢任何意见!
set.seed(63)
#create a function to generate time series data
tsfunc3 <- function (size=30, ar=0.7) {
ar.epsilon <- arima.sim(list(order = c(1,0,0), ar = 0.7), n = size, sd=2)
x1=rnorm(size)
x2=sample(1:5, size, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
x3=rbinom(size, 1, 0.5)
y=as.numeric(5 + 0.25*x1 + 0.4*x2 + 0.8*x3 + ar.epsilon) #A combination of continuous
#predictor x1, ordinal predictor
#x2 and binary predictor x3
data.frame(time=1:size, x1=x1, x2=x2, x3=x3, y=y)}
#A time series
tdat <- tsfunc3()
# Block length derived from the data based on the approach proposed by Politis & White
#(2003):
b <- 3
#Initial values
#blocks=tdat[1:3,c(2,3,4,5)]
n <- 30
#A sequence of blocks
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))
#MBB for intercept estimator
IntMbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "(Intercept)", level = 0.95)
}
#MBB for x1 coefficient estimator
B1Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x1", level = 0.95)
}
#MBB for x2 coefficient estimator
B2Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x2", level = 0.95)
}
#MBB for x3 coefficient estimator
B3Mbb <- function() {
take.blocks <- sample(1:28,10,replace=TRUE)
newdat <- tdat[unlist(blocks[take.blocks]),]
x1 <- unlist(newdat["x1"])
x2 <- unlist(newdat["x2"])
x3 <- unlist(newdat["x3"])
y <- unlist(newdat["y"])
regmbb <- lm(y ~ x1 + x2 + x3)
confint(regmbb, "x3", level = 0.95)
}
#Replications
set.seed(47)
R <- 100
int.mbb <- replicate(R, IntMbb(), simplify=FALSE)
b1.mbb <- replicate(R, B1Mbb(), simplify=FALSE)
b2.mbb <- replicate(R, B2Mbb(), simplify=FALSE)
b3.mbb <- replicate(R, B3Mbb(), simplify=FALSE)
#Calculate coverage probability for intercept estimator
int.ci <- t(sapply(int.mbb, function(x, y) x[grep(y, rownames(x)), ], "Intercept"))
sum(int.ci[,"2.5 %"] <=5 & 5 <= int.ci[,"97.5 %"])/R
[1] 0.34
#Calculate coverage probability for x1 coefficient estimator
int.ci <- t(sapply(b1.mbb, function(x, y) x[grep(y, rownames(x)), ], "x1"))
sum(int.ci[,"2.5 %"] <=0.25 & 0.25 <= int.ci[,"97.5 %"])/R
[1] 0.9
#Calculate coverage probability for x2 coefficient estimator
int.ci <- t(sapply(b2.mbb, function(x, y) x[grep(y, rownames(x)), ], "x2"))
sum(int.ci[,"2.5 %"] <=0.4 & 0.4 <= int.ci[,"97.5 %"])/R
[1] 0.38
#Calculate coverage probability for x3 coefficient estimator
int.ci <- t(sapply(b3.mbb, function(x, y) x[grep(y, rownames(x)), ], "x3"))
sum(int.ci[,"2.5 %"] <=0.8 & 0.8 <= int.ci[,"97.5 %"])/R
[1] 0.33
如您所见,只有 x1 系数估计器的覆盖概率是可以的。那么我的代码有什么问题吗?还是跟MBB本身有关系?
您并没有真正评估 bootstrap 的覆盖概率。您需要根据 bootstrapped 统计数据建立置信区间,而不是根据 bootstrapped 样本的参数模型 运行 建立置信区间。这是我的做法。
首先,我们可以生成数据:
set.seed(45301)
b <- 3
n <- 30
nblocks <- ceiling(n/b)
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))
#A time series
tdat <- tsfunc3(size=n, ar=.7)
接下来,我们可以编写一个我们将 bootstrap 的函数。此函数生成 bootstrap 样本,运行 回归并保存系数。
bsfun <- function(data, blocks){
samp.data <- data[sample(1:length(blocks), length(blocks), replace=TRUE), ]
mod <- lm(y ~ x1 + x2 + x3, data=samp.data)
coef(mod)
}
接下来,我们可以 运行 函数很多次。请注意,要生成可靠的 95% 百分位数置信区间,您应该拥有 1500-2500 bootstrap 附近的统计数据。您尝试表征的分位数越远,您需要的 bootstrap 个样本就越多。因此,下面的代码生成一组 bootstrap 系数:
out <- t(replicate(1000, bsfun(data=tdat, blocks=blocks)))
根据这组 bootstrap 统计数据,我们可以得出一个置信区间。
ci1 <- t(apply(out, 2, quantile, probs=c(.025,.975), na.rm=TRUE))
# 2.5% 97.5%
# (Intercept) -0.3302237 10.258229
# x1 -1.7577214 2.301975
# x2 -0.8016478 2.049435
# x3 -3.0723869 6.190383
如果你想调查这些区间的覆盖概率,你将不得不做我上面做的很多次(我们会做 100 次,尽管为了得到更好的估计,你可能想做更多).然后我们可以编写一个小函数来评估一组估计的覆盖率:
eval_cover <- function(true = c(5,.25,.4, .8), obs){
out <- as.numeric(obs[,1] < true & obs[,2] > true)
names(out) <- rownames(obs)
out
}
然后,您可以将该函数应用于您生成的每个 bootstrap 置信区间。使用 rowMeans()
函数将获得覆盖 1/0 值的平均值,这将是覆盖概率。在这种情况下,仅使用 100 个间隔,覆盖率为 100%。
rowMeans(sapply(outci, function(x)eval_cover(obs=x)))
# (Intercept) x1 x2 x3
# 1 1 1 1