我们是否评估 cross_val_score 的准确性，然后评估测试数据的准确性？

Question

嗨，如果我们要使用以下方法评估 cv 准确性：

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42)

model=RandomForestClassifier(random_state=0)

k_folds = KFold(n_splits=5)
splits = k_folds.split(X_train, y_train)
cv_acc = cross_val_score(model, X_train, y_train, cv=splits, scoring='accuracy')

然后在测试集上评估性能是否很常见？

model=RandomForestClassifier(random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)

在从 cv_acc 到计算 accuracy 之前，是否应该涉及任何明确的步骤。我们会将哪个结果报告为正确的准确性？我在 cv_acc 中的准确率约为 92.5%，在 accuracy 中的准确率约为 87.5%。

谢谢:)

Answer 1

最常见的方法是运行它在训练集上。你有官方文档 here.

你运行对你的训练数据进行交叉验证，这样你就有几个不同的训练折叠，然后你获取这些参数并对其进行测试（没有运行ning 对你的测试进行交叉验证设置，只需简单地使用来自训练数据交叉验证的参数即可。

Answer 2

cross-validation 的目标是检查您计划使用的模型（模型 + 特定超参数）是否可泛化。您可以单独保留一个测试集以进行最终评估，并按照建议 here.

仅对训练数据使用 cross-validation

A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV.

以下是流程和我对图表每个阶段的评论 -

PARAMETERS：您已经选择了一个模型和一系列您想要建模的超参数，并且您正在尝试找出哪个模型+参数组合是最普遍的。
CROSS-VALIDATION：您对每个模型+参数组合使用 cross-validation 并检查 k-fold 准确性。

scores = cross_val_score(clf, X, y, cv=5)

#THIS IS GOOD! MODEL IS GENERALIZABLE ON k-FOLDS
array([0.96..., 1.  ..., 0.96..., 0.96..., 1.        ])

#THIS IS BAD! MODEL IS NOT GENERALIZABLE
array([0.68..., 0.42.  ..., 0.96..., 0.99..., 1.        ])

BEST PARAMETERS：您可以将 cross-validation 与网格搜索结合使用，以找到为您提供最通用模型的 Best 参数。

常见混淆 - 请不要将最佳参数与 k-fold 模型之一的参数混淆。每个 k-fold 模型在不同的 k-fold 数据上使用相同的模型+参数。 best-params 只是您在网格搜索中或手动选择的超参数。

DATASET/TRAINING DATA/TEST DATA：现在获取数据集并将其拆分为测试和训练，就像您通常做的那样（80 20 左右)
RETRAIN MODEL：使用网格搜索和 cross-validation 确定最佳参数，在训练数据集上重新训练模型并在测试数据上评分
最终评估：最终测试准确度（您应该报告）是您在测试数据上对最佳参数模型进行评分后获得的准确度。

将网格搜索视为对模型参数的探索，cross-validation 将其视为通过 k-fold 验证对给定数据的一组特定模型参数进行泛化的探索。这两个过程都有助于模型选择，一旦您选择了正确的模型，您就可以在原始训练数据上对其进行重新训练，并从测试数据中获得验证准确性。

请阅读 this link，因为它很好地解释了使用 cross-validation 的流程。

用 sklearn 作者的话说 -

When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model, and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, the final evaluation can be done on the test set.

什么是交叉验证？

在称为 k-fold CV 的基本方法中，训练集被分成 k 个较小的集合（下面描述了其他方法，但通常遵循相同的原则）。对于 k 个“折叠”中的每一个都遵循以下过程：

使用折叠作为训练数据训练模型；
生成的模型在数据的剩余部分上得到验证（即，它被用作测试集来计算性能度量，例如准确性）。

k-fold cross-validation 报告的性能指标是循环中计算的值的平均值。

这张图片应该总结了我上面讨论的所有内容。

我们是否评估 cross_val_score 的准确性，然后评估测试数据的准确性？

Do we evaluate accuracy on cross_val_score and then evaluate accuracy on test data?

python

classification

dataframe

pandas

scikit-learn