GridSearchCV 和 cross_val_score 在决策树的情况下给出不同的结果

Question

使用 GridSearchCV best_score_ 并将 best_params_ 从 GridSearchCV 放到 cross_val_score，我得到了不同的结果吗？这只发生在决策树和随机森林的情况下。而在 "SVM"、"KNN"、"LR" 的情况下，结果是相同的。
这是我正在使用的代码：

def dtree_param_selection(X,y):
    #create a dictionary of all values we want to test
    param_grid = { 'criterion':['gini','entropy'],'max_features':["auto", "sqrt", "log2"],'max_depth': np.arange(2, 20)}
    # decision tree model
    dtree_model=DecisionTreeClassifier()
    #use gridsearch to test all values
    dtree_gscv = GridSearchCV(dtree_model, param_grid, cv=10)
    #fit model to data
    dtree_gscv.fit(X, y)
    print(dtree_gscv.best_score_)
    return dtree_gscv.best_params_

dtree_param_selection(good_feature,label)

cross_val_score:

clf = DecisionTreeClassifier(dtree_gscv.best_params_)
acc = cross_val_score(clf,good_feature,label,cv=10)

Answer 1

对于基于树的模型，您应该在训练前设置 random_state 参数。它默认为 None。这将确保结果相同。

来自documentation：

random_state int or RandomState, default=None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random

Answer 2

问题可能是由于 GridSearchCV 和 cross_val_score 使用的树模型是用不同的随机种子创建的。如果是这种情况，您应该能够通过显式设置随机状态来修复它。如果你想从 GridSearchCV.best_params_ 创建 clf，那么你应该在参数网格中包含 random_state，即：

...
param_grid = { 'random_state': [0], ... }
...

解决此问题的另一种方法是，如果您直接在 cross val 函数中使用 GridSearchCV 的最佳模型，以确保您不会错过任何超参数：

acc = cross_val_score(dtree_gscv.best_model_, good_feature,label, cv=10)

GridSearchCV 和 cross_val_score 在决策树的情况下给出不同的结果

GridSearchCV and cross_val_score give different result in case of decision tree

python

classification