应该对原始数据还是拆分数据执行交叉验证评分？

Question

当我想用交叉验证评估我的模型时，我应该对原始数据（未在训练和测试中拆分的数据）还是训练/测试数据执行交叉验证？

我知道训练数据是用来拟合模型的，测试是用来评估的。如果我使用交叉验证，我是否仍应将数据拆分为训练和测试？

features = df.iloc[:,4:-1]
results = df.iloc[:,-1]

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)

clf = LogisticRegression()
model = clf.fit(x_train, y_train)

accuracy_test = cross_val_score(clf, x_test, y_test, cv = 5)

或者我应该这样做：

features = df.iloc[:,4:-1]
results = df.iloc[:,-1]

clf = LogisticRegression()
model = clf.fit(features, results)

accuracy_test = cross_val_score(clf, features, results, cv = 5)), 2)

或者可能有所不同？

Answer 1

我将尝试在此处总结 "best practice"：

1) 如果您想训练模型、微调参数并进行最终评估，我建议您将数据拆分为 training|val|test .

您使用 training 部分拟合模型，然后检查 val 部分的不同参数组合。最后，当您确定哪个 classifier/parameter 在 val 部分获得最佳结果时，您在 test 上进行评估以获得最后的休息。

在 test 部分进行评估后，您不应再更改参数。

2) 另一方面，有些人采用另一种方式，他们将数据分成 training 和 test，然后在训练部分和在最后他们在 test 部分对其进行评估。

如果你的数据比较大，我推荐你使用第一种方式，但是如果你的数据比较小，那么2.

Answer 2

你的两种做法都是错误的。

在第一个中，您将交叉验证应用于 test 集，这是没有意义的
在第二个中，你首先用你的整个数据拟合模型，然后你执行交叉验证，这也是没有意义的。此外，该方法是多余的（cross_val_score 方法未使用您的拟合 clf，它自己进行拟合）

由于你没有做任何超参数调整（即你似乎只对性能评估感兴趣），有两种方法：

要么有单独的测试集
或交叉验证

第一种方式（测试集）：

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)

clf = LogisticRegression()
model = clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

accuracy_test = accuracy_score(y_test, y_pred)

第二种方式（交叉验证）：

from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle

clf = LogisticRegression()

# shuffle data first:
features_s, results_s = shuffle(features, results)
accuracy_cv = cross_val_score(clf, features_s, results_s, cv = 5, scoring='accuracy')

# fit the model afterwards with the whole data, if satisfied with the performance:
model = clf.fit(features, results)

应该对原始数据还是拆分数据执行交叉验证评分？

Should Cross Validation Score be performed on original or split data?

python

machine-learning

scikit-learn

cross-validation