解释随机森林模型结果
Interpreting Random Forest Model Results
非常感谢您对我的 RF 模型的解释以及如何总体评估结果的反馈。
57658 samples
27 predictor
2 classes: 'stayed', 'left'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 11531, 11531, 11532, 11532, 11532
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 gini 0.6273579 0.9999011 0.0006250729
2 extratrees 0.6246980 0.9999197 0.0005667791
14 gini 0.5968382 0.9324610 0.1116113149
14 extratrees 0.6192781 0.9740323 0.0523004026
27 gini 0.5584677 0.7546156 0.2977507092
27 extratrees 0.5589923 0.7635036 0.2905489827
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.
在对 Y 变量的函数形式以及拆分数据的方式进行了多次调整后,我得到了以下结果:
我的 ROC 略有改善,但有趣的是,与我的初始模型相比,我的 Sens & Spec 发生了巨大变化。
35000 samples
27 predictor
2 classes: 'stayed', 'left'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 7000, 7000, 7000, 7000, 7000
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 gini 0.6351733 0.0004618204 0.9998685
2 extratrees 0.6287926 0.0000000000 0.9999899
14 gini 0.6032979 0.1346653886 0.9170874
14 extratrees 0.6235212 0.0753069696 0.9631711
27 gini 0.5725621 0.3016414054 0.7575899
27 extratrees 0.5716616 0.2998190728 0.7636219
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.
这一次,我随机拆分数据,而不是按时间拆分,并使用以下代码对多个 mtry 值进行了实验:
```{r Cross Validation Part 1}
set.seed(1992) # setting a seed for replication purposes
folds <- createFolds(train_data$left_welfare, k = 5) # Partition the data into 5 equal folds
tune_mtry <- expand.grid(mtry = c(2,10,15,20), splitrule = c("variance", "extratrees"), min.node.size = c(1,5,10))
sapply(folds,length)
得到如下结果:
Random Forest
84172 samples
14 predictor
2 classes: 'stayed', 'left'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 16834, 16834, 16834, 16835, 16835
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 variance 0.5000000 NaN NaN
2 extratrees 0.7038724 0.3714761 0.8844723
5 variance 0.5000000 NaN NaN
5 extratrees 0.7042525 0.3870192 0.8727755
8 variance 0.5000000 NaN NaN
8 extratrees 0.7014818 0.4075797 0.8545012
10 variance 0.5000000 NaN NaN
10 extratrees 0.6956536 0.4336180 0.8310368
12 variance 0.5000000 NaN NaN
12 extratrees 0.6771292 0.4701687 0.7777730
15 variance 0.5000000 NaN NaN
15 extratrees 0.5000000 NaN NaN
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 5, splitrule = extratrees and min.node.size = 1.
看起来你的随机森林对第二 class "left" 几乎没有预测能力。
最好的分数都具有极高的敏感性和低特异性,这基本上意味着您 classifier 只是 class 将所有内容 class "stayed",我想这是大多数 class。不幸的是,这非常糟糕,因为它与天真的 class 说一切都来自第一个 class 并没有太大关系。
另外,我不太明白你是否只尝试了 mtry 2,14 和 27 的值,但在那种情况下我强烈建议尝试整个 3-25 范围(最佳值很可能在中间的某个地方)。
除此之外,由于性能看起来相当糟糕(根据 ROC 判断),我建议您在特征工程上做更多工作以提取更多信息。否则,如果您对所拥有的感到满意,或者您认为无法提取更多内容,只需调整 class 化的概率阈值,以便您具有反映您对 [=20= 的要求的灵敏度和特异性]es(你可能更关心 miscassifying "stayed" 而不是 "left" 反之亦然,我不知道你的问题)。
希望对您有所帮助!
非常感谢您对我的 RF 模型的解释以及如何总体评估结果的反馈。
57658 samples
27 predictor
2 classes: 'stayed', 'left'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 11531, 11531, 11532, 11532, 11532
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 gini 0.6273579 0.9999011 0.0006250729
2 extratrees 0.6246980 0.9999197 0.0005667791
14 gini 0.5968382 0.9324610 0.1116113149
14 extratrees 0.6192781 0.9740323 0.0523004026
27 gini 0.5584677 0.7546156 0.2977507092
27 extratrees 0.5589923 0.7635036 0.2905489827
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.
在对 Y 变量的函数形式以及拆分数据的方式进行了多次调整后,我得到了以下结果: 我的 ROC 略有改善,但有趣的是,与我的初始模型相比,我的 Sens & Spec 发生了巨大变化。
35000 samples
27 predictor
2 classes: 'stayed', 'left'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 7000, 7000, 7000, 7000, 7000
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 gini 0.6351733 0.0004618204 0.9998685
2 extratrees 0.6287926 0.0000000000 0.9999899
14 gini 0.6032979 0.1346653886 0.9170874
14 extratrees 0.6235212 0.0753069696 0.9631711
27 gini 0.5725621 0.3016414054 0.7575899
27 extratrees 0.5716616 0.2998190728 0.7636219
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.
这一次,我随机拆分数据,而不是按时间拆分,并使用以下代码对多个 mtry 值进行了实验:
```{r Cross Validation Part 1}
set.seed(1992) # setting a seed for replication purposes
folds <- createFolds(train_data$left_welfare, k = 5) # Partition the data into 5 equal folds
tune_mtry <- expand.grid(mtry = c(2,10,15,20), splitrule = c("variance", "extratrees"), min.node.size = c(1,5,10))
sapply(folds,length)
得到如下结果:
Random Forest
84172 samples
14 predictor
2 classes: 'stayed', 'left'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 16834, 16834, 16834, 16835, 16835
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 variance 0.5000000 NaN NaN
2 extratrees 0.7038724 0.3714761 0.8844723
5 variance 0.5000000 NaN NaN
5 extratrees 0.7042525 0.3870192 0.8727755
8 variance 0.5000000 NaN NaN
8 extratrees 0.7014818 0.4075797 0.8545012
10 variance 0.5000000 NaN NaN
10 extratrees 0.6956536 0.4336180 0.8310368
12 variance 0.5000000 NaN NaN
12 extratrees 0.6771292 0.4701687 0.7777730
15 variance 0.5000000 NaN NaN
15 extratrees 0.5000000 NaN NaN
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 5, splitrule = extratrees and min.node.size = 1.
看起来你的随机森林对第二 class "left" 几乎没有预测能力。
最好的分数都具有极高的敏感性和低特异性,这基本上意味着您 classifier 只是 class 将所有内容 class "stayed",我想这是大多数 class。不幸的是,这非常糟糕,因为它与天真的 class 说一切都来自第一个 class 并没有太大关系。
另外,我不太明白你是否只尝试了 mtry 2,14 和 27 的值,但在那种情况下我强烈建议尝试整个 3-25 范围(最佳值很可能在中间的某个地方)。
除此之外,由于性能看起来相当糟糕(根据 ROC 判断),我建议您在特征工程上做更多工作以提取更多信息。否则,如果您对所拥有的感到满意,或者您认为无法提取更多内容,只需调整 class 化的概率阈值,以便您具有反映您对 [=20= 的要求的灵敏度和特异性]es(你可能更关心 miscassifying "stayed" 而不是 "left" 反之亦然,我不知道你的问题)。
希望对您有所帮助!