随机森林回归 - 累积 MSE？

Question

我是随机森林的新手，我有一个关于回归的问题。我正在使用 R 包 randomForests 来计算 RF 模型。

我的最终目标是 select 组对于预测连续性状很重要的变量，所以我正在计算一个模型，然后我删除平均精度下降最低的变量，然后计算新模型等等。这适用于 RF 分类，我使用来自预测（训练集）、开发和验证数据集的 OOB 错误比较了模型。现在有了回归，我想比较基于 %variation explained 和 MSE 的模型。

我正在评估 MSE 和 %var 解释的结果，当使用来自 model$predicted 的预测手动计算时，我得到了完全相同的结果。但是当我做 model$mse 时，显示的值对应于计算的最后一棵树的 MSE 值，% var explained 也是如此。

作为示例，您可以在 R 中尝试此代码：

library(randomForest)
data("iris")
head(iris)

TrainingX<-iris[1:100,2:4] #creating training set - X matrix
TrainingY<-iris[1:100,1]  #creating training set - Y vector

TestingX<-iris[101:150,2:4]  #creating test set - X matrix
TestingY<-iris[101:150,1]  #creating test set - Y vector

set.seed(2)

model<-randomForest(x=TrainingX, y= TrainingY, ntree=500, #calculating model
                    xtest = TestingX, ytest = TestingY)

#for prediction (training set)

pred<-model$predicted

meanY<-sum(TrainingY)/length(TrainingY)

varpY<-sum((TrainingY-meanY)^2)/length(TrainingY)

mseY<-sum((TrainingY-pred)^2)/length(TrainingY)

r2<-(1-(mseY/varpY))*100

#for testing (test set)

pred_2<-model$test$predicted

meanY_2<-sum(TestingY)/length(TestingY)

varpY_2<-sum((TestingY-meanY_2)^2)/length(TestingY)

mseY_2<-sum((TestingY-pred_2)^2)/length(TestingY)

r2_2<-(1-(mseY_2/varpY_2))*100

training_set_mse<-c(model$mse[500], mseY)
training_set_rsq<-c(model$rsq[500]*100, r2)
testing_set_mse<-c(model$test$mse[500],mseY_2)
testing_set_rsq<-c(model$test$rsq[500]*100, r2_2)

c<-cbind(training_set_mse,training_set_rsq,testing_set_mse, testing_set_rsq)
rownames(c)<-c("last tree", "by hand")
c
model

作为运行这段代码后的结果，您将获得一个 table，其中包含 MSE 和 %var explaines（也称为 rsq）的值。第一行称为 "last tree"，包含森林中第 500 棵树解释的 MSE 和 %var 的值。第二行称为 "by hand"，它包含在 R 中基于向量 model$predicted 和 model$test$predicted.

计算的结果

所以，我的问题是：

1- 树的预测是否以某种方式累积？或者它们是相互独立的？（我以为他们是独立的）

2- 最后一棵树是否被视为所有其他树的平均值？

3- 为什么 MSE 和 %var 对 RF 模型的解释（当你调用 model 时出现在主板上）与第 500 棵树中的相同（见 [=43 的第一行） =])?向量 model$mse 或 model$rsq 是否包含累积值？

在最后一次编辑之后，我从 Andy Liaw（软件包的创建者之一）那里找到了这个 post，它说 MSE 和 %var 解释实际上是累积的！：https://stat.ethz.ch/pipermail/r-help/2004-April/049943.html。

Answer 1

不确定我明白你的问题是什么；尽管如此，我还是会尝试一下...

1- Are the predictions of the trees somehow cumulative? Or are they independent from each other? (I thought they were independent)

你想对了；这些树彼此独立地拟合，因此它们的预测确实是独立的。事实上，这是 RF 模型的一个关键优势，因为它允许并行实现。

2- Is the last tree to be considered as an average of all the others?

否;如上所述，所有棵树都是独立的。

3- If each tree gets a prediction, how can I get the matrix with all the trees, since what I need is the MSE and % var explained for the forest?

考虑到上面的代码，这里是您的问题开始真正不清楚的地方；你说你需要的 MSE 和 r2 正是你已经在 mseY 和 r2:

中计算的

mseY
[1] 0.1232342

r2
[1] 81.90718

不出所料，这些值与 model:

报告的值完全相同

model
# result:

Call:
 randomForest(x = TrainingX, y = TrainingY, ntree = 500) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 0.1232342
                    % Var explained: 81.91

所以我不确定我是否真的能看到你的问题，或者这些值与 "matrix with all the trees"...

有什么关系

But when I do model$mse, the value presented corresponds to the value of MSE for the last tree calculated, and the same happens for % var explained.

~~肯定是不是：model$mse 是一个长度等于树数（此处为 500）的向量，包含每个树的 MSE个人树；~~（见下面的更新）我在实践中从未见过任何用途（类似于 model$rsq）：

length(model$mse)
[1] 500

length(model$rsq)
[1] 500

更新：感谢 OP 本人（见评论），她发现 model$mse 和 model$rsq 中的数量确实是累计 (!);来自包维护者 Andy Liaw 的旧 (2004) 线程，Extracting the MSE and % Variance from RandomForest:

Several ways:

Read ?randomForest, especially the `Value' section.

Look at str(myforest.rf).

Look at print.randomForest.

If the forest has 100 trees, then the mse and rsq are vectors with 100 elements each, the i-th element being the mse (or rsq) of the forest consisting of the first i trees. So the last element is the mse (or rsq) of the whole forest.

随机森林回归 - 累积 MSE？

Random forest regression - cumulative MSE?

r

machine-learning

random-forest