为什么在训练模型之前应用交叉验证

Why applying cross validation before training a model

所以,我很难理解为什么作为一种常见做法,对尚未训练的模型进行交叉验证步骤。在 here 中可以找到我所说的示例。下面贴一段代码:

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# prepare the cross-validation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

问题:

  1. 此时交叉验证的目的是什么?
  2. 是否对该代码的任何部分进行了一些培训?
  3. RepeatedKFold 如何有助于解决不平衡数据集(让我们假设是这种情况)。

提前致谢!

cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

根据 documentation,“cross_val_score”使用给定的交叉验证技术拟合模型,有 在上面的代码中,“model”包含将要拟合的模型,“cv”包含有关交叉验证方法的信息,“cross_val_score”将使用该方法构建训练和 CV 集并评估模型。

换句话说,这些只是定义,实际的训练和 CV 发生在“cross_val_score”函数中。

How does RepeatedKFold contributes to tackling an unbalance dataset (let's assume that this is the case).

KFold CV 通常不会处理不平衡的数据集,它只是确保结果不会因 training/CV 数据集的选择而产生偏差,

Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.

如果你想解决一个不平衡的数据集,你必须使用比准确性更好的metric,比如'balanced_accuracy'或'roc_auc'并确保训练和CV 数据集既有正例也有负例。