为什么套袋树的错误率比单棵树的错误率高得多?
Why is the error rate from bagging trees much higher than that from a single tree?
我交叉post这个问题here,但在我看来,我不太可能得到任何答案。所以我post它在这里。
我是运行分类方法Bagging Tree(Bootstrap聚合)并将错误分类错误率与一棵树进行比较。我们期望 bagging tree 的结果比单树的结果更好,即 bagging 的错误率低于单树。
我重复整个过程M = 100次(每次将原始数据集随机拆分为训练集和测试集)以获得100个测试错误和装袋测试错误(使用for循环)。然后我用箱线图来比较这两种错误的分布。
# Loading package and data
library(rpart)
library(boot)
library(mlbench)
data(PimaIndiansDiabetes)
# Initialization
n <- 768
ntrain <- 468
ntest <- 300
B <- 100
M <- 100
single.tree.error <- vector(length = M)
bagging.error <- vector(length = M)
# Define statistic
estim.pred <- function(a.sample, vector.of.indices)
{
current.train <- a.sample[vector.of.indices, ]
current.fitted.model <- rpart(diabetes ~ ., data = current.train, method = "class")
predict(current.fitted.model, test.set, type = "class")
}
for (j in 1:M)
{
# Split the data into test/train sets
train.idx <- sample(1:n, ntrain, replace = FALSE)
train.set <- PimaIndiansDiabetes[train.idx, ]
test.set <- PimaIndiansDiabetes[-train.idx, ]
# Train a direct tree model
fitted.tree <- rpart(diabetes ~ ., data = train.set, method = "class")
pred.test <- predict(fitted.tree, test.set, type = "class")
single.tree.error[j] <- mean(pred.test != test.set$diabetes)
# Bootstrap estimates
res.boot = boot(train.set, estim.pred, B)
pred.boot <- vector(length = ntest)
for (i in 1:ntest)
{
pred.boot[i] <- ifelse (mean(res.boot$t[, i] == "pos") >= 0.5, "pos", "neg")
}
bagging.error[j] <- mean(pred.boot != test.set$diabetes)
}
boxplot(single.tree.error, bagging.error, ylab = "Misclassification errors", names = c("single.tree", "bagging"))
结果是
你能解释一下为什么bagging trees的错误率比单个树高很多吗?我觉得这说不通。我检查了我的代码,但没有发现任何异常。
我收到了 https://stats.stackexchange.com/questions/452882/why-is-the-error-rate-from-bagging-trees-much-higher-than-that-from-a-single-tre 的答复。我把它贴在这里是为了结束这个问题并供未来的访客使用。
我交叉post这个问题here,但在我看来,我不太可能得到任何答案。所以我post它在这里。
我是运行分类方法Bagging Tree(Bootstrap聚合)并将错误分类错误率与一棵树进行比较。我们期望 bagging tree 的结果比单树的结果更好,即 bagging 的错误率低于单树。
我重复整个过程M = 100次(每次将原始数据集随机拆分为训练集和测试集)以获得100个测试错误和装袋测试错误(使用for循环)。然后我用箱线图来比较这两种错误的分布。
# Loading package and data
library(rpart)
library(boot)
library(mlbench)
data(PimaIndiansDiabetes)
# Initialization
n <- 768
ntrain <- 468
ntest <- 300
B <- 100
M <- 100
single.tree.error <- vector(length = M)
bagging.error <- vector(length = M)
# Define statistic
estim.pred <- function(a.sample, vector.of.indices)
{
current.train <- a.sample[vector.of.indices, ]
current.fitted.model <- rpart(diabetes ~ ., data = current.train, method = "class")
predict(current.fitted.model, test.set, type = "class")
}
for (j in 1:M)
{
# Split the data into test/train sets
train.idx <- sample(1:n, ntrain, replace = FALSE)
train.set <- PimaIndiansDiabetes[train.idx, ]
test.set <- PimaIndiansDiabetes[-train.idx, ]
# Train a direct tree model
fitted.tree <- rpart(diabetes ~ ., data = train.set, method = "class")
pred.test <- predict(fitted.tree, test.set, type = "class")
single.tree.error[j] <- mean(pred.test != test.set$diabetes)
# Bootstrap estimates
res.boot = boot(train.set, estim.pred, B)
pred.boot <- vector(length = ntest)
for (i in 1:ntest)
{
pred.boot[i] <- ifelse (mean(res.boot$t[, i] == "pos") >= 0.5, "pos", "neg")
}
bagging.error[j] <- mean(pred.boot != test.set$diabetes)
}
boxplot(single.tree.error, bagging.error, ylab = "Misclassification errors", names = c("single.tree", "bagging"))
结果是
你能解释一下为什么bagging trees的错误率比单个树高很多吗?我觉得这说不通。我检查了我的代码,但没有发现任何异常。
我收到了 https://stats.stackexchange.com/questions/452882/why-is-the-error-rate-from-bagging-trees-much-higher-than-that-from-a-single-tre 的答复。我把它贴在这里是为了结束这个问题并供未来的访客使用。