scikit 学习中 coverage_error 指标的说明

Question

我不明白 coverage_error 在 scikit learn 中是如何计算的，在 sklearn.metrics 模块中可用。文档中的解释如下：

The coverage_error function computes the average number of labels that have to be included in the final prediction such that all true labels are predicted.

例如：

import numpy as np
from sklearn.metrics import coverage_error
y_true = np.array([[1, 0, 0], [0, 1, 1]])
y_score = np.array([[1, 0, 0], [0, 1, 1]])
print coverage_error(y_true, y_score)
1.5

根据我的理解，这里我们需要包含预测中的 3 个标签才能获得 y_true 中的所有标签。所以覆盖误差 = 3/2，即 1.5。但我无法理解在以下情况下会发生什么：

>>> y_score = np.array([[1, 0, 0], [0, 0, 1]])
>>> print coverage_error(y_true, y_score)
2.0
>>> y_score = np.array([[1, 0, 1], [0, 1, 1]])
>>> print coverage_error(y_true, y_score)
2.0

为什么两种情况下的错误相同？

Answer 1

你可以看看User Guide 3.3.3. Multilabel ranking metrics

with

您需要注意的一件事是如何计算排名并打破排名 y_score。

具体来说，第一种情况：

In [4]: y_true
Out[4]:
array([[1, 0, 0],
       [0, 1, 1]])

In [5]: y_score
Out[5]:
array([[1, 0, 0],
       [0, 0, 1]])

对于第一个样本，第一个真标签为真，第一个分数的排名为1。
对于2ed样本，2ed和3rd true label为true，score的排名分别为3和1，所以max rank为3。
平均值为 (3+1)/2=2。

第二种情况：

In [7]: y_score
Out[7]:
array([[1, 0, 1],
       [0, 1, 1]])

对于第一个样本，第一个真标签为真，第一个分数的排名为2。
对于2ed样本，2ed和3rd true label为true，score的ranks分别为2和2，所以max rank为2
平均值为 (2+2)/2=2。

编辑：

排名在 y_score 的一个样本内。公式表示标签的排名是得分大于或等于其得分的标签（包括其本身）的数量。

就像按y_score排序标签一样，得分最高的标签排在第1位，第二大的排在第2位，第三大的排在第3位，依此类推。但是如果第二个和第三大标签得分相同，都排在第3位。

注意 y_score 是

Target scores, can either be probability estimates of the positive class, confidence values, or binary decisions.

我们的目标是预测所有真实标签，因此我们需要包括所有得分高于或等于真实标签的标签。

scikit 学习中 coverage_error 指标的说明

Explanation for coverage_error metric in scikit learn

scikit-learn

multilabel-classification