如何为 Scikit-learn 分类器添加加权损失？

Question

在许多 ML 应用程序中，加权损失可能是可取的，因为某些类型的错误预测可能比其他错误产生更糟糕的结果。例如。在医学二元分类 (healthy/ill) 中，假阴性（即患者未接受进一步检查）比假阳性（后续检查会发现错误）的结果更糟糕。

所以如果我像这样定义一个加权损失函数：

def weighted_loss(prediction, target):
    if prediction == target:
        return 0  # correct, no loss
    elif prediction == 0:  # class 0 is healthy
        return 100  # false negative, very bad
    else:
        return 1  # false positive, incorrect

如何将与此等效的内容传递给 scikit-learn 分类器，例如 Random Forests or SVM 分类器？

Answer 1

恐怕你的问题是 ill-posed，源于 loss 和 metric 的不同概念之间的根本混淆.

Loss 函数 not 适用于 prediction == target 类型的条件 - 这就是 metrics（如准确度、精确度、召回率等）做 - 然而，在损失优化（即训练）期间不起作用，并且仅用于性能评估。损失 不适用于 硬性 class 预测；它仅适用于 classifier 的 probabilistic 输出，这种等式条件从不适用。

损失和指标之间的额外“绝缘”层是选择阈值，这对于转换 classifier 的概率输出是必要的（在训练) 期间唯一重要的事情到“硬”class 预测（唯一对正在考虑的业务问题重要的事情）。同样，这个阈值在模型训练期间完全没有作用（唯一相关的量是损失，它对阈值和硬 class 预测一无所知）；很好地放在交叉验证线程 Reduce Classification Probability Threshold:

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

尽管您当然可以尝试使用 narrowly-defined 模型训练之外的额外程序来优化此（决策）阈值（即损失最小化），正如您简要描述的在评论中，您的期望是

I am pretty sure that I'd get better results if the decision boundaries drawn by the RBFs took that into account, when fitting to the data

使用类似于您的 weight_loss 函数的东西是徒劳的。

因此，没有类似于此处显示的 weight_loss 的函数（本质上是 度量标准 ，而不是损失函数，尽管它的名称如此），它采用等式条件prediction == target，可用于模型训练

以下 SO 线程中的讨论也可能有助于澄清问题：

Loss & accuracy - Are these reasonable learning curves?
What is the difference between loss function and metric in Keras?（尽管有标题，但定义是普遍适用的，不仅适用于 Keras）
Cost function training target versus accuracy desired goal
How to interpret loss and accuracy for a machine learning model

如何为 Scikit-learn 分类器添加加权损失？

How to add weighted loss to Scikit-learn classifiers?

python

classification

machine-learning

scikit-learn