获取 RandomizedSearchCV 最佳模型的概率

Question

我正在使用 RandomizedSearchCV 通过 10 折交叉验证和 100 次迭代获得最佳参数。这很好用。但现在我还想从性能最佳的模型中获得每个预测测试数据点（如 predict_proba）的概率。

如何做到这一点？

我看到两个选项。首先，也许可以直接从 RandomizedSearchCV 或第二个中获取这些概率，从 RandomizedSearchCV 中获取最佳参数，然后再次进行 10 折交叉验证（使用相同的种子，以便我得到相同的分割）与这个最好的参数。

编辑：以下代码是否正确以获得最佳性能模型的概率？ X 是训练数据，y 是标签，模型是我的 RandomizedSearchCV，其中包含 Pipeline，其中包含缺失值、标准化和 SVM。

cv_outer = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
y_prob = np.empty([y.size, nrClasses]) * np.nan
best_model = model.fit(X, y).best_estimator_

for train, test in cv_outer.split(X, y):
    probas_ = best_model.fit(X[train], y[train]).predict_proba(X[test])
    y_prob[test] = probas_

Answer 1

如果我没理解错的话，您希望获得 CV 分数最高的案例中每个样本的单独分数。如果是这种情况，您必须使用其中一种 CV 生成器来控制拆分索引，例如此处的那些：http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#cross-validation-generators

如果您想使用性能最佳的模型计算新测试样本的分数，RandomizedSearchCV 的 predict_proba() 函数就足够了，前提是您的基础模型支持它。

示例：

import numpy
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
scores = cross_val_score(svc, X, y, cv=skf, n_jobs=-1)
max_score_split = numpy.argmax(scores)

现在您知道您的最佳模型出现在 max_score_split，您可以自己拆分并用它拟合您的模型。

train_indices, test_indices = k_fold.split(X)[max_score_split]
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
model.fit(X_train, y_train) # this is your model object that should have been created before

最后通过以下方式获得您的预测：

model.predict_proba(X_test)

我自己没有测试过代码，但应该稍作修改即可。

Answer 2

您需要查看 cv_results_ 这将为您提供所有折叠的分数和平均分数，以及平均、拟合时间等...

如果你想 predict_proba() 每次迭代，方法是循环遍历 cv_results_ 中给出的参数，然后为每个迭代重新拟合模型，然后预测概率，因为据我所知，单个模型没有缓存在任何地方。

best_params_ 将为您提供最适合的参数，如果您想下次只使用最佳参数训练模型。

见信息页cv_results_http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

获取 RandomizedSearchCV 最佳模型的概率

Getting probabilities of best model for RandomizedSearchCV

python

machine-learning

scikit-learn

grid-search