在 sklearn 中使用网格搜索和管道获得正确的交叉验证分数

Question

我的设置：我正在运行一个过程（=管道），在这个过程中我运行在选择相关变量之后进行回归（在标准化数据之后 - 我省略了这些步骤，因为它们是在这种情况下无关），我将通过网格搜索进行优化，如下所示

fold = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=777)
regression_estimator = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10, solver='newton-cg')
pipeline_steps = [('feature_selection', SelectKBest(f_regression)), ('regression', regression_estimator)]

pipe = Pipeline(steps=pipeline_steps)

feature_selection_k_options = np.arange(1, 33, 3)

param_grid = {'feature_selection__k': feature_selection_k_options}

gs = GridSearchCV(pipe, param_grid=param_grid, scoring='recall', cv=fold)
gs.fit(X, y)

因为在 GridSearchCV 中默认 refit=True，我在默认情况下得到 best_estimator，我对此很好。我缺少的是，鉴于此 best_estimator，我如何仅在程序中预先拆分的测试数据上获得交叉验证分数。事实上，有 .score(X, Y) 方法，但是，正如文档所指示的那样 (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba) "Returns the mean accuracy on the given test data and labels" whereas I want what is done through cross_val_score (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)。问题是这个过程重新运行一切并只保留那些结果（我想要从这个过程中产生的所有数量）。

本质上，我想从最佳估计器中提取测试数据的交叉验证分数，其中包含我选择的度量（或已在网格搜索中选择的度量）以及已嵌入的 CrossValidated 算法在我的 Pipeline 中（在本例中为 StratifiedShuffleSplit）

你知道怎么做吗？

Answer 1

您可以通过 cv_results_ 属性访问交叉验证分数，该属性可以方便地读入 pandas DataFrame：

import pandas as pd
df_result = pd.DataFrame(gs.cv_results_)

关于 "with a measure of my choosing"，您可以查看 this 显示如何在 GridSearchCV 中一次计算多个得分手的示例。

在 sklearn 中使用网格搜索和管道获得正确的交叉验证分数

Getting proper cross validation scores with grid search and pipelines in sklearn

pipeline

scikit-learn

cross-validation

grid-search