如何计算随机森林的 class 权重

Question

我有 2 个 classes 的数据集，我必须对其执行二进制 classification。我选择随机森林作为 classifier，因为它为我提供了其他模型中最好的准确性。 dataset-1 中的数据点数为 462，dataset-2 包含 735 个数据点。我注意到我的数据有轻微的 class 不平衡，所以我尝试优化我的训练模型并通过提供 class 权重重新训练我的模型。我提供了 class 权重的以下值。

cwt <- c(0.385,0.614) # Class weights
ss <- c(300,300) # Sample size

我使用以下代码训练模型

tr_forest <- randomForest(output ~., data = train,
          ntree=nt, mtry=mt,importance=TRUE, proximity=TRUE,
          maxnodes=mn,sampsize=ss,classwt=cwt,
          keep.forest=TRUE,oob.prox=TRUE,oob.times= oobt,
          replace=TRUE,nodesize=ns, do.trace=1
          )

使用选择的 class 权重提高了我的模型的准确性，但我仍然怀疑我的方法是否正确或者只是巧合。如何确保我的 class 体重选择是完美的？

我使用以下公式计算了 class 个权重：

Class weight for positive class = (No. of datapoints in dataset-1)/(Total datapoints)

Class weight for negative class = (No. of datapoints in dataset-2)/(Total datapoints))
 For dataset-1 462/1197 = 0.385
 For dataset-2 735/1197 = 0.614

这是一种可接受的方法吗？如果不是，为什么它会提高我的模型的准确性。请帮助我了解 class 权重的细微差别。

Answer 1

How can I make sure my class weight choice is perfect?

嗯，你当然不能 - 完美在这里绝对是错误的词；我们正在寻找有用的 heuristics，它既能提高性能又有意义（即它们不像魔术）。

鉴于此，我们确实有一种独立的方式 cross-checking 您的选择（这看起来确实不错），尽管在 Python 中而不是在 R 中：scikit-learn 方法 compute_class_weight;我们甚至不需要确切的数据 - 只需要您已经提供的每个 class 的样本编号：

import numpy as np
from sklearn.utils.class_weight import compute_class_weight

y_1 = np.ones(462)     # dataset-1
y_2 = np.ones(735) + 1 # dataset-2
y = np.concatenate([y_1, y_2])
len(y)
# 1197

classes=[1,2]
cw = compute_class_weight('balanced', classes, y)
cw
# array([ 1.29545455,  0.81428571])

实际上，这些是您的数字乘以 ~ 2.11，即：

cw/2.11
# array([ 0.6139595,  0.3859174])

看起来不错（乘以常数不影响结果），保留一个细节：似乎 scikit-learn 建议我们使用您的数字转置，即class 1 的权重为 0.614，class 2 的权重为 0.386，而不是根据您的计算相反。

我们刚刚进入了 确切定义 的微妙之处 class 权重实际上是什么，这在框架和库中不一定相同。 scikit-learn 使用这些权重对 misclass 化成本进行不同的加权，因此为 少数 [=] 分配更大的权重是有意义的69=] class;这是 Breiman（RF 的发明者）和 Andy Liaw（randomForest R 包的维护者）在 draft paper 中的想法：

We assign a weight to each class, with the minority class given larger weight (i.e., higher misclassification cost).

然而，不是randomForest R 方法中的classwt 参数；来自 docs:

classwt Priors of the classes. Need not add up to one. Ignored for regression.

" classes" 的先验实际上是 class 存在的类比，即你在这里计算的值；这种用法似乎是一个相关的（和高度投票的）SO 线程的共识，What does the parameter 'classwt' in RandomForest function in RandomForest package in R stand for?; additionally, Andy Liaw himself has stated（强调我的）：

The current "classwt" option in the randomForest package [...] is different from how the official Fortran code (version 4 and later) implements class weights.

我猜官方的 Fortran 实现与之前引用的草稿文件中描述的一样（即 scikit-learn-like）。

我在 6 年前的硕士论文中使用 RF 处理不平衡数据，据我所知，我发现 sampsize 参数比 classwt 更有用， Andy Liaw（再次......）反对advised（强调我的）：

Search in the R-help archive to see other options and why you probably shouldn't use classwt.

此外，在关于详细解释的已经相当 "dark" 的上下文中，完全不清楚使用 both sampsize 和 classwt 参数在一起，就像你在这里所做的那样...

至wrap-up：

你所做的似乎确实是正确和合乎逻辑的

您应该尝试单独使用 classwt 和 sampsize 参数（而不是同时使用），以确保提高准确性归于

如何计算随机森林的 class 权重

How to calculate class weights for Random forests

r

classification

machine-learning

random-forest