解释随机森林模型结果

Question

非常感谢您对我的 RF 模型的解释以及如何总体评估结果的反馈。

57658 samples
   27 predictor
    2 classes: 'stayed', 'left' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 11531, 11531, 11532, 11532, 11532 
Resampling results across tuning parameters:

  mtry  splitrule   ROC        Sens       Spec        
   2    gini        0.6273579  0.9999011  0.0006250729
   2    extratrees  0.6246980  0.9999197  0.0005667791
  14    gini        0.5968382  0.9324610  0.1116113149
  14    extratrees  0.6192781  0.9740323  0.0523004026
  27    gini        0.5584677  0.7546156  0.2977507092
  27    extratrees  0.5589923  0.7635036  0.2905489827

Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.

在对 Y 变量的函数形式以及拆分数据的方式进行了多次调整后，我得到了以下结果：我的 ROC 略有改善，但有趣的是，与我的初始模型相比，我的 Sens & Spec 发生了巨大变化。

35000 samples
   27 predictor
    2 classes: 'stayed', 'left' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 7000, 7000, 7000, 7000, 7000 
Resampling results across tuning parameters:

  mtry  splitrule   ROC        Sens          Spec     
   2    gini        0.6351733  0.0004618204  0.9998685
   2    extratrees  0.6287926  0.0000000000  0.9999899
  14    gini        0.6032979  0.1346653886  0.9170874
  14    extratrees  0.6235212  0.0753069696  0.9631711
  27    gini        0.5725621  0.3016414054  0.7575899
  27    extratrees  0.5716616  0.2998190728  0.7636219

Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 1.

这一次，我随机拆分数据，而不是按时间拆分，并使用以下代码对多个 mtry 值进行了实验：

```{r Cross Validation Part 1}
set.seed(1992) # setting a seed for replication purposes 

folds <- createFolds(train_data$left_welfare, k = 5) # Partition the data into 5 equal folds

tune_mtry <- expand.grid(mtry = c(2,10,15,20), splitrule = c("variance", "extratrees"), min.node.size = c(1,5,10))

sapply(folds,length)

得到如下结果：

Random Forest 

84172 samples
   14 predictor
    2 classes: 'stayed', 'left' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 16834, 16834, 16834, 16835, 16835 
Resampling results across tuning parameters:

  mtry  splitrule   ROC        Sens       Spec     
   2    variance    0.5000000        NaN        NaN
   2    extratrees  0.7038724  0.3714761  0.8844723
   5    variance    0.5000000        NaN        NaN
   5    extratrees  0.7042525  0.3870192  0.8727755
   8    variance    0.5000000        NaN        NaN
   8    extratrees  0.7014818  0.4075797  0.8545012
  10    variance    0.5000000        NaN        NaN
  10    extratrees  0.6956536  0.4336180  0.8310368
  12    variance    0.5000000        NaN        NaN
  12    extratrees  0.6771292  0.4701687  0.7777730
  15    variance    0.5000000        NaN        NaN
  15    extratrees  0.5000000        NaN        NaN

Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 5, splitrule = extratrees and min.node.size = 1.

Answer 1

看起来你的随机森林对第二 class "left" 几乎没有预测能力。最好的分数都具有极高的敏感性和低特异性，这基本上意味着您 classifier 只是 class 将所有内容 class "stayed"，我想这是大多数 class。不幸的是，这非常糟糕，因为它与天真的 class 说一切都来自第一个 class 并没有太大关系。
另外，我不太明白你是否只尝试了 mtry 2,14 和 27 的值，但在那种情况下我强烈建议尝试整个 3-25 范围（最佳值很可能在中间的某个地方）。

除此之外，由于性能看起来相当糟糕（根据 ROC 判断），我建议您在特征工程上做更多工作以提取更多信息。否则，如果您对所拥有的感到满意，或者您认为无法提取更多内容，只需调整 class 化的概率阈值，以便您具有反映您对 [=20= 的要求的灵敏度和特异性]es（你可能更关心 miscassifying "stayed" 而不是 "left" 反之亦然，我不知道你的问题）。

希望对您有所帮助！

解释随机森林模型结果

Interpreting Random Forest Model Results

machine-learning

data-modeling

random-forest