roc_auc_score() 和 auc() 的不同结果

Question

我无法理解 scikit-learn 中 roc_auc_score() 和 auc() 之间的区别（如果有的话）。

我想预测不平衡类的二进制输出（Y=1 时约为 1.5%）。

分类器

model_logit = LogisticRegression(class_weight='auto')
model_logit.fit(X_train_ridge, Y_train)

Roc 曲线

false_positive_rate, true_positive_rate, thresholds = roc_curve(Y_test, clf.predict_proba(xtest)[:,1])

AUC

auc(false_positive_rate, true_positive_rate)
Out[490]: 0.82338034042531527

和

roc_auc_score(Y_test, clf.predict(xtest))
Out[493]: 0.75944737191205602

有人可以解释这个区别吗？我以为两者都只是在计算 ROC 曲线下的面积。可能是因为数据集不平衡，但我不知道为什么。

谢谢！

Answer 1

AUC 并不总是 ROC 曲线下的面积。 Area Under the Curve 是 some 曲线下的一个（抽象的）区域，所以它是一个比 AUROC 更笼统的东西。在不平衡的情况下类，找到精确召回曲线的 AUC 可能更好。

请参阅 roc_auc_score 的 sklearn 源代码：

def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
    # <...> docstring <...>
    def _binary_roc_auc_score(y_true, y_score, sample_weight=None):
            # <...> bla-bla <...>

            fpr, tpr, tresholds = roc_curve(y_true, y_score,
                                            sample_weight=sample_weight)
            return auc(fpr, tpr, reorder=True)

    return _average_binary_score(
        _binary_roc_auc_score, y_true, y_score, average,
        sample_weight=sample_weight)

可以看到，这个先获取roc曲线，然后调用auc()获取面积

我猜你的问题是 predict_proba() 电话。对于正常的 predict()，输出总是相同的：

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, roc_auc_score

est = LogisticRegression(class_weight='auto')
X = np.random.rand(10, 2)
y = np.random.randint(2, size=10)
est.fit(X, y)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y, est.predict(X))
print auc(false_positive_rate, true_positive_rate)
# 0.857142857143
print roc_auc_score(y, est.predict(X))
# 0.857142857143

如果为此更改以上内容，有时会得到不同的输出：

false_positive_rate, true_positive_rate, thresholds = roc_curve(y, est.predict_proba(X)[:,1])
# may differ
print auc(false_positive_rate, true_positive_rate)
print roc_auc_score(y, est.predict(X))

Answer 2

predict returns 只有一个 class 或另一个。然后你用classifier上predict的结果计算一个ROC，只有三个阈值（试验所有一个class，所有其他class，并且在之间）。您的 ROC 曲线如下所示：

      ..............................
      |
      |
      |
......|
|
|
|
|
|
|
|
|
|
|
|

同时，predict_proba() returns 整个概率范围，因此现在您可以在数据上设置三个以上的阈值。

             .......................
             |
             |
             |
          ...|
          |
          |
     .....|
     |
     |
 ....|
.|
|
|
|
|

因此区域不同。

Answer 3

当您使用 y_pred（class 标签）时，您已经决定门槛。当你使用 y_prob （正 class 概率）您对阈值持开放态度，ROC 曲线应该有所帮助你决定门槛。

对于第一种情况，您使用的概率是：

y_probs = clf.predict_proba(xtest)[:,1]
fp_rate, tp_rate, thresholds = roc_curve(y_true, y_probs)
auc(fp_rate, tp_rate)

当你这样做时，你正在考虑 AUC 'before' 决定您将使用的阈值。

在第二种情况下，您使用的是预测（而不是概率），在这种情况下，对您和您都使用 'predict' 而不是 'predict_proba' 应该得到相同的结果。

y_pred = clf.predict(xtest)
fp_rate, tp_rate, thresholds = roc_curve(y_true, y_pred)
print auc(fp_rate, tp_rate)
# 0.857142857143

print roc_auc_score(y, y_pred)
# 0.857142857143

roc_auc_score() 和 auc() 的不同结果

Different result with roc_auc_score() and auc()

python

machine-learning

scikit-learn

分类器

Roc 曲线

AUC