具有交叉验证奇怪行为的 PR-ROC 曲线

PR-ROC curve with cross validation strange behaviour

受到 this ROC curve using cross validation 的启发,我尝试使用交叉验证创建 PR-ROC 曲线。然而,生成的 PR-ROC 曲线看起来很奇怪,而不是我在没有 CV 的情况下使用它时 PR-ROC 曲线通常看起来的样子。这是:

代码如下:

rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)

ppv_arr = list()
pr_auc_arr = list()
base_tpr = np.linspace(0, 1, 101)

for train_index, test_index in rskf.split(X, y):
    
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    calibrated_clf.fit(X_train, y_train)
    
    y_hat = calibrated_clf.predict_proba(X_test)
    ppv, tpr, _ = precision_recall_curve(y_test, y_hat[:, 1], pos_label='positive')
    
    pr_auc = average_precision_score(y_test, y_hat[:, 1], pos_label='positive')
    pr_auc_arr.append(pr_auc)
    
    plt.plot(ppv, tpr, color='r', alpha=0.15)
    ppv = np.interp(base_tpr, ppv, tpr)
    ppv[0] = 0.0
    ppv_arr.append(ppv)

ppv_arr = np.array(ppv_arr)
mean_ppv = ppv_arr.mean(axis=0)
std = ppv_arr.std(axis=0)

ppv_upper = np.minimum(mean_ppv + std, 1)
ppv_lower = mean_ppv - std

plt.plot(mean_ppv, base_tpr, label=f'AUC: {np.mean(pr_auc_arr):.2f}', color='r')
plt.fill_between(base_tpr, ppv_lower, ppv_upper, color='grey', alpha=0.3)
plt.plot([0, 1], [1, 0], 'b--')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.ylabel('Positive Predictive Value')
plt.xlabel('True Positive Rate')
plt.title('KNN PR-ROC Curve and PR-AUC')
plt.legend(loc='best')
plt.show()

我不确定是什么问题。我查看了网络链接,因为它基于 ROC 曲线,所以我可能不小心包含了一些与 ROC 相关的内容,但我什么也没看到。也许这就是带有 CV 的 PR-ROC 的样子?或者也许这个特定模型 (KNN) 只是不好,这就是曲线看起来如此奇怪的原因。

在无法找出导致问题的原因之后,我决定再次从头开始,遵循 ROC 曲线示例而不是那么严格(因为我的 PR-ROC 曲线表现得与 ROC 太相似了曲线)。

下面是新的工作代码供参考:

rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)

y_real = list()
y_proba = list()
ppv_arr = list()
tpr_arr = np.linspace(0, 1, 100)
    
for train_index, test_index in rskf.split(X, y):
    
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    calibrated_clf.fit(X_train, y_train)
    
    y_hat = calibrated_clf.predict_proba(X_test)
    
    ppv, tpr, _ = precision_recall_curve(y_test, y_hat[:, 1], pos_label='positive')
    ppv, tpr = ppv[::-1], tpr[::-1]
    
    precision_arr = np.interp(tpr_arr, tpr, ppv)
    pr_auc = average_precision_score(y_test, y_hat[:, 1], pos_label='positive')
    ppv_arr.append(precision_arr)
    
    plt.subplot(222)
    plt.plot(tpr, ppv, color='r', alpha=0.15)
    
    y_real.append(y_test)
    y_proba.append(y_hat[:, 1])

y_real = np.concatenate(y_real)
y_proba = np.concatenate(y_proba)

ppv, tpr, _ = precision_recall_curve(y_real, y_proba, pos_label='positive')

average_ppv = average_precision_score(y_real, y_proba, pos_label='positive')
mean_ppv = np.mean(ppv_arr, axis=0)
std_ppv = np.std(ppv_arr, axis=0)

plt.subplot(222)
plt.plot(tpr, ppv, color='r', label=f'AUC: {average_ppv:.4f}')
plt.fill_between(tpr_arr, mean_ppv + std_ppv, mean_ppv - std_ppv, alpha=0.3, linewidth=0, color='grey')
plt.plot([0, 1], [1, 0], 'b--')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.ylabel('Positive Predictive Value')
plt.xlabel('True Positive Rate')
plt.title('KNN PR-ROC Curve and PR-AUC')
plt.legend(loc='best')
plt.show()

PR-ROC 曲线看起来像这样: