使用多个 类 绘制 ROC 曲线

Plotting ROC Curve with Multiple Classes

我正在关注为多个 classes 绘制 ROC 曲线的文档,在此 link:http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

我对这一行特别困惑:

y_score = classifier.fit(X_train, y_train).decision_function(X_test)

我看到在其他示例中,y_score 具有概率,并且它们都是正值,正如我们所期望的那样。但是,此示例中的 y_score(classes A-C 的每一列)大多为负值。有趣的是,它们加起来还是-1:

In: y_score[0:5,:]
Out: array([[-0.76305896, -0.36472635,  0.1239796 ],
            [-0.20238399, -0.63148982, -0.16616656],
            [ 0.11808492, -0.80262259, -0.32062486],
            [-0.90750303, -0.1239792 ,  0.02184016],
            [-0.01108555, -0.27918155, -0.71882525]])

我该如何解释?我如何仅从 y_score 判断模型对每个输入的预测是哪个 class?

编辑:所有相关代码:

import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle

from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp

# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
                                                    random_state=0)

# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', 
                                 probability=True,
                                 random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)

decision_function returns the distance of the sample from the decision boundary of each class. It wouldn't be the probability. If you want to find out probability, you would use the predict_proba method. If you want to find out what class the estimator assigns the sample, then use predict.

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier

# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
                                                    random_state=0)

# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', 
                                 probability=True,
                                 random_state=random_state))

# train the classifier
classifer.fit(X_train, y_train)

# generate y_score
y_score = classifier.decision_function(X_test)

# generate probabilities
y_prob = classifier.predict_proba(X_test)

# generate predictions
y_pred = classifier.predict(X_test)

结果:

>>> y_score[0:5,:]
array([[-0.76305896, -0.36472635,  0.1239796 ],
       [-0.20238399, -0.63148982, -0.16616656],
       [ 0.11808492, -0.80262259, -0.32062486],
       [-0.90750303, -0.1239792 ,  0.02184016],
       [-0.01108555, -0.27918155, -0.71882525]])
>>> y_prob[0:5,:]
array([[0.06019732, 0.24174159, 0.8293423 ],
       [0.35610687, 0.30121076, 0.46392587],
       [0.65735935, 0.34605074, 0.25675446],
       [0.03458982, 0.19539083, 0.72575167],
       [0.53656981, 0.22445759, 0.03221816]])
>>> y_pred[0:5,:]
array([[0, 0, 1],
       [0, 0, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 0, 0]])

要实际绘制多 class ROC,请使用 label_binarize 函数。

使用 Iris 数据的示例:

import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier

iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)

classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
                                 random_state=0))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)

fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()