重要性(随机森林)和 RandomForest$importance 之间的区别
Difference between Importance(random forest) and RandomForest$importance
我不明白随机森林模型的重要性函数(randomForest 包)和重要性值之间有什么区别:
我计算了一个简单的 RF 分类模型并尝试通过以下代码找到变量重要性:
rf_model$importance
0 1 MeanDecreaseAccuracy MeanDecreaseGini
X1 0.096886458 0.032546101 0.055488009 2472.172207
X2 0.030985037 0.025615202 0.027530078 1338.378297
X3 0.124302743 0.012551971 0.052402188 3091.891586
importance(rf_model)
0 1 MeanDecreaseAccuracy MeanDecreaseGini
X1 159.9149603 175.6265625 242.424683 2472.172207
X2 104.8273654 97.09338154 129.5084398 1338.378297
X3 157.0207876 86.93847182 216.6374153 3091.891586
为什么 MeanDecreaseGini 相同时输出的前三列有差异?
默认情况下调用 importance(rf_model)
时,测量值除以它们的“标准误差”。考虑这个例子:
library(randomForest)
set.seed(4543)
data(mtcars)
mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000,
keep.forest=FALSE, importance=TRUE)
mtcars.rf$importance
#output
%IncMSE IncNodePurity
cyl 7.3939431 162.38777
disp 10.0468306 257.46627
hp 7.6801388 200.22729
drat 1.0921653 65.96165
wt 9.7998328 250.94940
qsec 0.6066792 38.52055
vs 0.7048540 24.75183
am 0.6201962 17.27180
gear 0.4110634 16.33811
carb 1.0549523 27.47096
同上
importance(mtcars.rf, scale = FALSE)
%IncMSE IncNodePurity
cyl 7.3939431 162.38777
disp 10.0468306 257.46627
hp 7.6801388 200.22729
drat 1.0921653 65.96165
wt 9.7998328 250.94940
qsec 0.6066792 38.52055
vs 0.7048540 24.75183
am 0.6201962 17.27180
gear 0.4110634 16.33811
carb 1.0549523 27.47096
default:
importance(mtcars.rf)
%IncMSE IncNodePurity
cyl 15.767986 162.38777
disp 19.885128 257.46627
hp 18.177916 200.22729
drat 7.002942 65.96165
wt 18.479239 250.94940
qsec 5.022593 38.52055
vs 4.427525 24.75183
am 6.435329 17.27180
gear 3.968845 16.33811
carb 8.207903 27.47096
最后:
importance(mtcars.rf, scale = FALSE)[,1]/mtcars.rf$importanceSD
cyl disp hp drat wt qsec vs am gear carb
15.767986 19.885128 18.177916 7.002942 18.479239 5.022593 4.427525 6.435329 3.968845 8.207903
等同于importance(mtcars.rf)[,1]
all.equal(importance(mtcars.rf, scale = FALSE)[,1]/mtcars.rf$importanceSD,
importance(mtcars.rf)[,1])
#output
TRUE
我不明白随机森林模型的重要性函数(randomForest 包)和重要性值之间有什么区别:
我计算了一个简单的 RF 分类模型并尝试通过以下代码找到变量重要性:
rf_model$importance
0 1 MeanDecreaseAccuracy MeanDecreaseGini
X1 0.096886458 0.032546101 0.055488009 2472.172207
X2 0.030985037 0.025615202 0.027530078 1338.378297
X3 0.124302743 0.012551971 0.052402188 3091.891586
importance(rf_model)
0 1 MeanDecreaseAccuracy MeanDecreaseGini
X1 159.9149603 175.6265625 242.424683 2472.172207
X2 104.8273654 97.09338154 129.5084398 1338.378297
X3 157.0207876 86.93847182 216.6374153 3091.891586
为什么 MeanDecreaseGini 相同时输出的前三列有差异?
默认情况下调用 importance(rf_model)
时,测量值除以它们的“标准误差”。考虑这个例子:
library(randomForest)
set.seed(4543)
data(mtcars)
mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=1000,
keep.forest=FALSE, importance=TRUE)
mtcars.rf$importance
#output
%IncMSE IncNodePurity
cyl 7.3939431 162.38777
disp 10.0468306 257.46627
hp 7.6801388 200.22729
drat 1.0921653 65.96165
wt 9.7998328 250.94940
qsec 0.6066792 38.52055
vs 0.7048540 24.75183
am 0.6201962 17.27180
gear 0.4110634 16.33811
carb 1.0549523 27.47096
同上
importance(mtcars.rf, scale = FALSE)
%IncMSE IncNodePurity
cyl 7.3939431 162.38777
disp 10.0468306 257.46627
hp 7.6801388 200.22729
drat 1.0921653 65.96165
wt 9.7998328 250.94940
qsec 0.6066792 38.52055
vs 0.7048540 24.75183
am 0.6201962 17.27180
gear 0.4110634 16.33811
carb 1.0549523 27.47096
default:
importance(mtcars.rf)
%IncMSE IncNodePurity
cyl 15.767986 162.38777
disp 19.885128 257.46627
hp 18.177916 200.22729
drat 7.002942 65.96165
wt 18.479239 250.94940
qsec 5.022593 38.52055
vs 4.427525 24.75183
am 6.435329 17.27180
gear 3.968845 16.33811
carb 8.207903 27.47096
最后:
importance(mtcars.rf, scale = FALSE)[,1]/mtcars.rf$importanceSD
cyl disp hp drat wt qsec vs am gear carb
15.767986 19.885128 18.177916 7.002942 18.479239 5.022593 4.427525 6.435329 3.968845 8.207903
等同于importance(mtcars.rf)[,1]
all.equal(importance(mtcars.rf, scale = FALSE)[,1]/mtcars.rf$importanceSD,
importance(mtcars.rf)[,1])
#output
TRUE