不平衡数据集的支持向量机性能不佳——如何改进？

Question

考虑一个数据集 A，其中包含用于训练二进制 class 化问题的示例。由于数据集高度不平衡，我使用了 SVM 并应用了加权方法（在 MATLAB 中）。我应用的权重与每个 class 中的数据频率成反比。这是在使用命令

进行训练时完成的

 fitcsvm(trainA, trainTarg , ...
            'KernelFunction', 'RBF', 'KernelScale', 'auto', ...
            'BoxConstraint', C,'Weight',weightTrain  );

我已经使用了 10 折交叉验证进行训练并学习了超参数。因此，在 CV 内部，数据集 A 被分成训练集 (trainA) 和验证集 (valA)。训练结束并超出 CV 循环后，我得到了 A:

上的混淆矩阵

80025 1
0 140

其中第一行代表多数 class，第二行代表少数 class。只有 1 个误报 (FP)，所有少数 class 个示例都已正确 class 化为真阳性 (TP) = 140。

问题：然后，我运行在训练期间从未见过的新的未见过的测试数据集B 上训练模型。这是在 B 上进行测试的混淆矩阵。

50075 0
100 0

可以看出，少数class根本没有被class化，因此权重的目的已经失败。虽然没有 FP，但 SVM 无法捕获少数 class 个示例。我没有在 B 上应用任何权重或平衡方法，例如采样（SMOTE、RUSBoost 等）。可能出了什么问题以及如何解决这个问题？

Answer 1

Class错误class可以设置化权重而不是样本权重！

您可以根据以下示例设置class 权重。

错误-class class A（n 条记录；显性）到 class B（m 条记录；少数 class）的权重可以是 n/m。 Mis-classification weight For class B as class A 可以根据严重程度设置为 1 或 m/n，你想将其强加于学习

c=[0 2.2;1 0];
mod=fitcsvm(X,Y,'Cost',c)

根据documentation：

For two-class learning, if you specify a cost matrix, then the software updates the prior probabilities by incorporating the penalties described in the cost matrix. Consequently, the cost matrix resets to the default. For more details on the relationships and algorithmic behavior of BoxConstraint, Cost, Prior, Standardize, and Weights, see Algorithms.

Answer 2

Area Under Curve (AUC) 通常用于衡量模型在不平衡数据上的表现。绘制 ROC 曲线以直观地获得更多见解也很好。仅对此类模型使用混淆矩阵可能会导致误解。

Statistics and Machine Learning Toolbox 中的

perfcurve 提供了这两种功能。

不平衡数据集的支持向量机性能不佳——如何改进？

Poor performance for SVM for unbalanced dataset- how to improve?

matlab

classification

machine-learning

svm