回归树分析:为特定拆分生成拆分置信度
Regression tree analysis: generating split confidence for specific splits
我正在尝试为回归树的特定拆分生成 bootstrap 置信度 'intervals'
rpart
(生成树)和 boot
(生成 bootstrap)- 详述 question/answer.
例子:
data(iris)
library(rpart)
r1<-rpart(Sepal.Length ~ ., cp = 0.05, data=iris)
plot(r1)
text(r1)
library(boot)
trainData <- iris[-150L, ]
predictData <- iris[150L, ]
rboot <- boot(trainData, function(data, idx) {
bootstrapData <- data[idx, ]
r1 <- rpart(Sepal.Length ~ ., bootstrapData, cp = 0.05)
predict(r1, newdata = predictData)
}, 1000L)
生成分位数,因为 rpart
没有 CI 函数:
quantile(rboot$t, c(0.025, 0.975))
2.5% 97.5%
5.871393 6.766842
没关系,但是,我如何根据预测变量获得每个拆分的 'quantile' 估计值。例如,"Petal.Length<3.4"?
两边的分位数
这是一个解决方案(由非会员提供)。唯一的问题是没有。拆分在启动运行之间波动,可能是小 n 的函数。
适合模特
library(boot)
library(rpart)
library(lattice)
data(iris)
names(iris)
iris2 <- iris[,c(1,3)]
r1 <- rpart(Sepal.Length ~ Petal.Length, cp = 0.05, data=iris2)
r1$splits
r1$frame
情节树
plot(r1)
text(r1)
手动启动
n.boot <- 10000
输入编号要查看的拆分数
n.split <- 3 #change this according to no. of splits on tree
store_matrix <- array(0,c(n.boot,n.split))` #column 1 will contain split, col 2 split 2, etc
trainData <- iris2
for (i in 1:n.boot) {
iboot <- sample(1:nrow(trainData), replace = TRUE)
bootdata <- trainData[iboot,]
r <- rpart(Sepal.Length ~ Petal.Length, bootdata, cp = 0.05)
r
r$frame
r$split
store_matrix[i,] <- r$splits[1:n.split,4]
}
生成间隔和宽度
split.n <- 2 #choose which split to look at
store1 <- store_matrix[,split.n] #select the distribution of split estimates for a specific split
(split_estimate <- r1$splits[split.n,4]) #check its the correct split
[1] 3.4
q1 <- quantile(na.omit(as.numeric(store1)), c(0.025, 0.975))
quantile(na.omit(as.numeric(store1)), c(0.025, 0.975)); as.numeric(q1)[2] - as.numeric(q1)[1]
2.5% 97.5%
1.45 5.65
[1] 4.2
我正在尝试为回归树的特定拆分生成 bootstrap 置信度 'intervals'
rpart
(生成树)和 boot
(生成 bootstrap)- 详述
例子:
data(iris)
library(rpart)
r1<-rpart(Sepal.Length ~ ., cp = 0.05, data=iris)
plot(r1)
text(r1)
library(boot)
trainData <- iris[-150L, ]
predictData <- iris[150L, ]
rboot <- boot(trainData, function(data, idx) {
bootstrapData <- data[idx, ]
r1 <- rpart(Sepal.Length ~ ., bootstrapData, cp = 0.05)
predict(r1, newdata = predictData)
}, 1000L)
生成分位数,因为 rpart
没有 CI 函数:
quantile(rboot$t, c(0.025, 0.975))
2.5% 97.5%
5.871393 6.766842
没关系,但是,我如何根据预测变量获得每个拆分的 'quantile' 估计值。例如,"Petal.Length<3.4"?
两边的分位数这是一个解决方案(由非会员提供)。唯一的问题是没有。拆分在启动运行之间波动,可能是小 n 的函数。
适合模特
library(boot)
library(rpart)
library(lattice)
data(iris)
names(iris)
iris2 <- iris[,c(1,3)]
r1 <- rpart(Sepal.Length ~ Petal.Length, cp = 0.05, data=iris2)
r1$splits
r1$frame
情节树
plot(r1)
text(r1)
手动启动
n.boot <- 10000
输入编号要查看的拆分数
n.split <- 3 #change this according to no. of splits on tree
store_matrix <- array(0,c(n.boot,n.split))` #column 1 will contain split, col 2 split 2, etc
trainData <- iris2
for (i in 1:n.boot) {
iboot <- sample(1:nrow(trainData), replace = TRUE)
bootdata <- trainData[iboot,]
r <- rpart(Sepal.Length ~ Petal.Length, bootdata, cp = 0.05)
r
r$frame
r$split
store_matrix[i,] <- r$splits[1:n.split,4]
}
生成间隔和宽度
split.n <- 2 #choose which split to look at
store1 <- store_matrix[,split.n] #select the distribution of split estimates for a specific split
(split_estimate <- r1$splits[split.n,4]) #check its the correct split
[1] 3.4
q1 <- quantile(na.omit(as.numeric(store1)), c(0.025, 0.975))
quantile(na.omit(as.numeric(store1)), c(0.025, 0.975)); as.numeric(q1)[2] - as.numeric(q1)[1]
2.5% 97.5%
1.45 5.65
[1] 4.2