移动块的覆盖概率问题bootstrap

Question

我将移动块 bootstrap (MBB) 应用于使用时间序列数据的回归模型。当我计算来自 MBB 的估计量的覆盖概率时，结果是异常的，除了一个系数（x1 的系数被设置为连续变量）。鉴于 MBB 是一种行之有效的方法（参见 https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.713.1262&rep=rep1&type=pdf and https://en.wikipedia.org/wiki/Bootstrapping_(statistics)），我想知道我的代码是否有问题。感谢任何意见！

set.seed(63)
#create a function to generate time series data
tsfunc3 <- function (size=30, ar=0.7) {
  ar.epsilon <- arima.sim(list(order = c(1,0,0), ar = 0.7), n = size, sd=2)
  x1=rnorm(size)
  x2=sample(1:5, size, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
  x3=rbinom(size, 1, 0.5)
  y=as.numeric(5 + 0.25*x1 + 0.4*x2 + 0.8*x3 + ar.epsilon) #A combination of continuous 
                                                           #predictor x1, ordinal predictor
                                                           #x2 and binary predictor x3
  data.frame(time=1:size, x1=x1, x2=x2, x3=x3, y=y)}

#A time series
tdat <- tsfunc3()

# Block length derived from the data based on the approach proposed by Politis & White 
#(2003): 
b <- 3
#Initial values
#blocks=tdat[1:3,c(2,3,4,5)]
n <- 30
#A sequence of blocks
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))

#MBB for intercept estimator
IntMbb <- function() { 
  take.blocks <- sample(1:28,10,replace=TRUE)
  newdat <- tdat[unlist(blocks[take.blocks]),]
  x1 <- unlist(newdat["x1"])
  x2 <- unlist(newdat["x2"])
  x3 <- unlist(newdat["x3"])
  y <- unlist(newdat["y"])
  regmbb <- lm(y ~ x1 + x2 + x3)
  confint(regmbb, "(Intercept)", level = 0.95)
}

#MBB for x1 coefficient estimator
B1Mbb <- function() { 
  take.blocks <- sample(1:28,10,replace=TRUE)
  newdat <- tdat[unlist(blocks[take.blocks]),]
  x1 <- unlist(newdat["x1"])
  x2 <- unlist(newdat["x2"])
  x3 <- unlist(newdat["x3"])
  y <- unlist(newdat["y"])
  regmbb <- lm(y ~ x1 + x2 + x3)
  confint(regmbb, "x1", level = 0.95)
}

#MBB for x2 coefficient estimator
B2Mbb <- function() { 
  take.blocks <- sample(1:28,10,replace=TRUE)
  newdat <- tdat[unlist(blocks[take.blocks]),]
  x1 <- unlist(newdat["x1"])
  x2 <- unlist(newdat["x2"])
  x3 <- unlist(newdat["x3"])
  y <- unlist(newdat["y"])
  regmbb <- lm(y ~ x1 + x2 + x3)
  confint(regmbb, "x2", level = 0.95)
}

#MBB for x3 coefficient estimator
B3Mbb <- function() { 
  take.blocks <- sample(1:28,10,replace=TRUE)
  newdat <- tdat[unlist(blocks[take.blocks]),]
  x1 <- unlist(newdat["x1"])
  x2 <- unlist(newdat["x2"])
  x3 <- unlist(newdat["x3"])
  y <- unlist(newdat["y"])
  regmbb <- lm(y ~ x1 + x2 + x3)
  confint(regmbb, "x3", level = 0.95)
}

#Replications
set.seed(47) 
R <- 100
int.mbb <- replicate(R, IntMbb(), simplify=FALSE)
b1.mbb <- replicate(R, B1Mbb(), simplify=FALSE)
b2.mbb <- replicate(R, B2Mbb(), simplify=FALSE)
b3.mbb <- replicate(R, B3Mbb(), simplify=FALSE)

#Calculate coverage probability for intercept estimator
int.ci <- t(sapply(int.mbb, function(x, y) x[grep(y, rownames(x)), ], "Intercept"))
sum(int.ci[,"2.5 %"] <=5 & 5 <= int.ci[,"97.5 %"])/R
[1] 0.34

#Calculate coverage probability for x1 coefficient estimator
int.ci <- t(sapply(b1.mbb, function(x, y) x[grep(y, rownames(x)), ], "x1"))
sum(int.ci[,"2.5 %"] <=0.25 & 0.25 <= int.ci[,"97.5 %"])/R
[1] 0.9

#Calculate coverage probability for x2 coefficient estimator
int.ci <- t(sapply(b2.mbb, function(x, y) x[grep(y, rownames(x)), ], "x2"))
sum(int.ci[,"2.5 %"] <=0.4 & 0.4 <= int.ci[,"97.5 %"])/R
[1] 0.38

#Calculate coverage probability for x3 coefficient estimator
int.ci <- t(sapply(b3.mbb, function(x, y) x[grep(y, rownames(x)), ], "x3"))
sum(int.ci[,"2.5 %"] <=0.8 & 0.8 <= int.ci[,"97.5 %"])/R
[1] 0.33

如您所见，只有 x1 系数估计器的覆盖概率是可以的。那么我的代码有什么问题吗？还是跟MBB本身有关系？

Answer 1

您并没有真正评估 bootstrap 的覆盖概率。您需要根据 bootstrapped 统计数据建立置信区间，而不是根据 bootstrapped 样本的参数模型运行建立置信区间。这是我的做法。

首先，我们可以生成数据：

set.seed(45301)
b <- 3
n <- 30
nblocks <- ceiling(n/b)
blocks <- lapply(seq_len(n-b+1), function(i) seq(i, i+b-1))

#A time series
tdat <- tsfunc3(size=n, ar=.7)

接下来，我们可以编写一个我们将 bootstrap 的函数。此函数生成 bootstrap 样本，运行回归并保存系数。

bsfun <- function(data, blocks){
  samp.data <- data[sample(1:length(blocks), length(blocks), replace=TRUE), ]
  mod <- lm(y ~ x1 + x2 + x3, data=samp.data)
  coef(mod)
}

接下来，我们可以运行函数很多次。请注意，要生成可靠的 95% 百分位数置信区间，您应该拥有 1500-2500 bootstrap 附近的统计数据。您尝试表征的分位数越远，您需要的 bootstrap 个样本就越多。因此，下面的代码生成一组 bootstrap 系数：

out <- t(replicate(1000, bsfun(data=tdat, blocks=blocks)))

根据这组 bootstrap 统计数据，我们可以得出一个置信区间。

ci1 <- t(apply(out, 2, quantile, probs=c(.025,.975), na.rm=TRUE))
#                   2.5%     97.5%
# (Intercept) -0.3302237 10.258229
# x1          -1.7577214  2.301975
# x2          -0.8016478  2.049435
# x3          -3.0723869  6.190383

如果你想调查这些区间的覆盖概率，你将不得不做我上面做的很多次（我们会做 100 次，尽管为了得到更好的估计，你可能想做更多).然后我们可以编写一个小函数来评估一组估计的覆盖率：

eval_cover <- function(true = c(5,.25,.4, .8), obs){
  out <- as.numeric(obs[,1] < true & obs[,2] > true)
  names(out) <- rownames(obs)
  out 
}

然后，您可以将该函数应用于您生成的每个 bootstrap 置信区间。使用 rowMeans() 函数将获得覆盖 1/0 值的平均值，这将是覆盖概率。在这种情况下，仅使用 100 个间隔，覆盖率为 100%。

rowMeans(sapply(outci, function(x)eval_cover(obs=x)))
# (Intercept)          x1          x2          x3 
#           1           1           1           1

移动块的覆盖概率问题bootstrap

Coverage probability problem for moving block bootstrap

simulation

regression

r

time-series