以下 SVC 的最佳 class 权重参数？

Question

您好，我正在使用 sklearn 执行 classifier，我有以下标签分布：

label : 0 frecuency :  119
label : 1 frecuency :  1615
label : 2 frecuency :  197
label : 3 frecuency :  70
label : 4 frecuency :  203
label : 5 frecuency :  137
label : 6 frecuency :  18
label : 7 frecuency :  142
label : 8 frecuency :  15
label : 9 frecuency :  182
label : 10 frecuency :  986
label : 12 frecuency :  73
label : 13 frecuency :  27
label : 14 frecuency :  81
label : 15 frecuency :  168
label : 18 frecuency :  107
label : 21 frecuency :  125
label : 22 frecuency :  172
label : 23 frecuency :  3870
label : 25 frecuency :  2321
label : 26 frecuency :  25
label : 27 frecuency :  314
label : 28 frecuency :  76
label : 29 frecuency :  116

一件明显突出的事情是我正在处理一个不平衡的数据集我有很多标签用于 class 25,23,1,10，训练后我得到了如下糟糕的结果:

             precision    recall  f1-score   support

          0       0.00      0.00      0.00        31
          1       0.61      0.23      0.34       528
          2       0.00      0.00      0.00        70
          3       0.67      0.06      0.11        32
          4       0.00      0.00      0.00        62
          5       0.78      0.82      0.80        39
          6       0.00      0.00      0.00         3
          7       0.00      0.00      0.00        46
          8       0.00      0.00      0.00         5
          9       0.00      0.00      0.00        62
         10       0.14      0.01      0.02       313
         12       0.00      0.00      0.00        30
         13       0.31      0.57      0.40         7
         14       0.00      0.00      0.00        35
         15       0.00      0.00      0.00        56
         18       0.00      0.00      0.00        35
         21       0.00      0.00      0.00        39
         22       0.00      0.00      0.00        66
         23       0.41      0.74      0.53      1278
         25       0.28      0.39      0.33       758
         26       0.50      0.25      0.33         8
         27       0.29      0.02      0.03       115
         28       1.00      0.61      0.76        23
         29       0.00      0.00      0.00        42

avg / total       0.33      0.39      0.32      3683

我得到很多零，SVC 无法从几个 class 中学习，我使用的超参数如下：

from sklearn import svm
clf2= svm.SVC(kernel='linear')

为了克服这个问题，我为每个 class 构建了一个带有权重的字典，如下所示：

weight={}
for i,v in enumerate(uniqLabels):
        weight[v]=labels_cluster.count(uniqLabels[i])/len(labels_cluster)

for i,v in weight.items():
        print(i,v)
print(weight)

这些是数字和输出，我只是用确定标签的元素数除以标签集中元素的总数，这些数字的总和是 1:

0 0.010664037996236221
1 0.14472622994892015
2 0.01765391164082803
3 0.006272963527197778
4 0.018191594228873554
5 0.012277085760372793
6 0.0016130477641365713
7 0.012725154583744062
8 0.0013442064701138096
9 0.01630970517071422
10 0.0883591719688144
12 0.0065418048212205395
13 0.002419571646204857
14 0.007258714938614571
15 0.015055112465274667
18 0.009588672820145173
21 0.011201720584281746
22 0.015413567523971682
23 0.34680526928936284
25 0.20799354780894344
26 0.0022403441168563493
27 0.028138722107715744
28 0.006810646115243301
29 0.01039519670221346

使用这个权重字典再次尝试如下：

from sklearn import svm
clf2= svm.SVC(kernel='linear',class_weight=weight)

我得到了：

             precision    recall  f1-score   support

          0       0.00      0.00      0.00        31
          1       0.90      0.19      0.31       528
          2       0.00      0.00      0.00        70
          3       0.00      0.00      0.00        32
          4       0.00      0.00      0.00        62
          5       0.00      0.00      0.00        39
          6       0.00      0.00      0.00         3
          7       0.00      0.00      0.00        46
          8       0.00      0.00      0.00         5
          9       0.00      0.00      0.00        62
         10       0.00      0.00      0.00       313
         12       0.00      0.00      0.00        30
         13       0.00      0.00      0.00         7
         14       0.00      0.00      0.00        35
         15       0.00      0.00      0.00        56
         18       0.00      0.00      0.00        35
         21       0.00      0.00      0.00        39
         22       0.00      0.00      0.00        66
         23       0.36      0.99      0.52      1278
         25       0.46      0.01      0.02       758
         26       0.00      0.00      0.00         8
         27       0.00      0.00      0.00       115
         28       0.00      0.00      0.00        23
         29       0.00      0.00      0.00        42

avg / total       0.35      0.37      0.23      3683

由于我没有得到好的结果，我非常感谢自动调整每个权重的建议 class 并表示在 SVC 中，我没有很多处理不平衡问题的经验，所以所有的建议都很好收到。

Answer 1

看来你在做与你应该做的相反的事情。特别是，你想要的是对较小的类赋予更高的权重，以便分类器在训练这些类时受到更多的惩罚。一个好的开始点是设置 class_weight="balanced"。

以下 SVC 的最佳 class 权重参数？

Optimal class weight parameter for the following SVC?

svc

scikit-learn