scikit-learn GridSearchCV 不能与随机森林一起正常工作
scikit-learn GridSearchCV does not work properly with random forest
我有一个随机森林模型的网格搜索实现。
train_X, test_X, train_y, test_y = train_test_split(features, target, test_size=.10, random_state=0)
# A bit performance gains can be obtained from standarization
train_X, test_X = standarize(train_X, test_X)
tuned_parameters = [{
'n_estimators': [5],
'criterion': ['mse', 'mae'],
'random_state': [0]
}]
scores = ['neg_mean_squared_error', 'neg_mean_absolute_error']
for n_fold in [5]:
for score in scores:
print("# Tuning hyper-parameters for %s with %d-fold" % (score, n_fold))
start_time = time.time()
print()
# TODO: RandomForestRegressor
clf = GridSearchCV(RandomForestRegressor(verbose=2), tuned_parameters, cv=n_fold,
scoring=score, verbose=2, n_jobs=-1)
clf.fit(train_X, train_y)
... Rest omitted
在我将它用于此网格搜索之前,我已经将完全相同的数据集用于许多其他任务,因此数据应该没有任何问题。另外,出于测试目的,我先用LinearRegression看看整个流水线是否顺利,是否有效。然后我切换到 RandomForestRegressor 并设置非常少量的估计器来再次测试它。他们发生了一件非常奇怪的事情,我将附上详细信息。性能下降非常明显,我不知道发生了什么。没有理由花费 30 分钟以上的时间来 运行 进行一次小型网格搜索。
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.0s remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.0s remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.1s remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.1s remaining: 0.0s
building tree 2 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.0s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.0s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.0s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.2s finished
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.0s finished
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.3s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.3s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.2s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.8s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.3s
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.8s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.3s
building tree 1 of 5
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.9s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.3s
building tree 1 of 5
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.9s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.3s
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.0s remaining: 0.0s
building tree 2 of 5
building tree 3 of 5
building tree 4 of 5
building tree 5 of 5
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.3s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.2s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.5s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.6s
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
几秒钟后打印出上面的日志,然后似乎卡在了这里...
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.4min remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.5min remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.5min remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.8min remaining: 0.0s
building tree 2 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
这些线路花费了20多分钟。
顺便说一句,对于每个 GridSearchCV 运行,线性回归成本不到 1 秒。
你知道为什么性能下降那么多吗?
欢迎任何建议和评论。谢谢。
尝试为 RandomForestRegressor 设置 max_depth
。这应该减少装配时间。默认 max_depth=None
。
例如:
tuned_parameters = [{
'n_estimators': [5],
'criterion': ['mse', 'mae'],
'random_state': [0],
'max_depth': [4],
}]
编辑:此外,默认情况下 RandomForestRegressor
有 n_jobs=1
。它将使用此设置一次构建一棵树。尝试设置 n_jobs=-1
。
此外,您可以指定多个指标,而不是将 scoring
参数循环到 GridSearchCV
。这样做时,您还必须将要 GridSearchCV
到 select 的指标指定为 refit
的值。然后,您可以在拟合后访问 cv_results_
字典中的所有分数。
clf = GridSearchCV(RandomForestRegressor(verbose=2),tuned_parameters,
cv=n_fold, scoring=scores, refit='neg_mean_squared_error',
verbose=2, n_jobs=-1)
clf.fit(train_X, train_y)
results = clf.cv_results_
print(np.mean(results['mean_test_neg_mean_squared_error']))
print(np.mean(results['mean_test_neg_mean_absolute_error']))
我有一个随机森林模型的网格搜索实现。
train_X, test_X, train_y, test_y = train_test_split(features, target, test_size=.10, random_state=0)
# A bit performance gains can be obtained from standarization
train_X, test_X = standarize(train_X, test_X)
tuned_parameters = [{
'n_estimators': [5],
'criterion': ['mse', 'mae'],
'random_state': [0]
}]
scores = ['neg_mean_squared_error', 'neg_mean_absolute_error']
for n_fold in [5]:
for score in scores:
print("# Tuning hyper-parameters for %s with %d-fold" % (score, n_fold))
start_time = time.time()
print()
# TODO: RandomForestRegressor
clf = GridSearchCV(RandomForestRegressor(verbose=2), tuned_parameters, cv=n_fold,
scoring=score, verbose=2, n_jobs=-1)
clf.fit(train_X, train_y)
... Rest omitted
在我将它用于此网格搜索之前,我已经将完全相同的数据集用于许多其他任务,因此数据应该没有任何问题。另外,出于测试目的,我先用LinearRegression看看整个流水线是否顺利,是否有效。然后我切换到 RandomForestRegressor 并设置非常少量的估计器来再次测试它。他们发生了一件非常奇怪的事情,我将附上详细信息。性能下降非常明显,我不知道发生了什么。没有理由花费 30 分钟以上的时间来 运行 进行一次小型网格搜索。
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.0s remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.0s remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.1s remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.1s remaining: 0.0s
building tree 2 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.0s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.0s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.0s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.2s finished
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.0s finished
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.3s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.3s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.2s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.8s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.3s
[CV] criterion=mse, n_estimators=5, random_state=0 ...................
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.8s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.3s
building tree 1 of 5
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.9s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.3s
building tree 1 of 5
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.9s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.3s
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 1.0s remaining: 0.0s
building tree 2 of 5
building tree 3 of 5
building tree 4 of 5
building tree 5 of 5
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 5.3s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.2s finished
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.5s finished
[CV] .... criterion=mse, n_estimators=5, random_state=0, total= 5.6s
[CV] criterion=mae, n_estimators=5, random_state=0 ...................
building tree 1 of 5
几秒钟后打印出上面的日志,然后似乎卡在了这里...
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.4min remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.5min remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.5min remaining: 0.0s
building tree 2 of 5
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.8min remaining: 0.0s
building tree 2 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 3 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 4 of 5
building tree 5 of 5
building tree 5 of 5
building tree 5 of 5
这些线路花费了20多分钟。
顺便说一句,对于每个 GridSearchCV 运行,线性回归成本不到 1 秒。
你知道为什么性能下降那么多吗?
欢迎任何建议和评论。谢谢。
尝试为 RandomForestRegressor 设置 max_depth
。这应该减少装配时间。默认 max_depth=None
。
例如:
tuned_parameters = [{
'n_estimators': [5],
'criterion': ['mse', 'mae'],
'random_state': [0],
'max_depth': [4],
}]
编辑:此外,默认情况下 RandomForestRegressor
有 n_jobs=1
。它将使用此设置一次构建一棵树。尝试设置 n_jobs=-1
。
此外,您可以指定多个指标,而不是将 scoring
参数循环到 GridSearchCV
。这样做时,您还必须将要 GridSearchCV
到 select 的指标指定为 refit
的值。然后,您可以在拟合后访问 cv_results_
字典中的所有分数。
clf = GridSearchCV(RandomForestRegressor(verbose=2),tuned_parameters,
cv=n_fold, scoring=scores, refit='neg_mean_squared_error',
verbose=2, n_jobs=-1)
clf.fit(train_X, train_y)
results = clf.cv_results_
print(np.mean(results['mean_test_neg_mean_squared_error']))
print(np.mean(results['mean_test_neg_mean_absolute_error']))