超参数调整

Question

我目前正在自己做一个项目。对于这个项目，我试图比较多种算法的结果。但我想确保每个测试的算法都配置为提供最佳结果。

所以我使用交叉验证来测试参数的每个组合并选择最佳的。

例如：

def KMeanstest(param_grid, n_jobs): 

    estimator = KMeans()

    cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=42)

    regressor = GridSearchCV(estimator=estimator, cv=cv, param_grid=param_grid, n_jobs=n_jobs) 

    regressor.fit(X_train, y_train) 

    print("Best Estimator learned through GridSearch") 
    print(regressor.best_estimator_)

    return cv, regressor.best_estimator_

param_grid={'n_clusters': [2], 
            'init': ['k-means++', 'random'],
            'max_iter': [100, 200, 300, 400, 500],
            'n_init': [8, 9, 10, 11, 12, 13, 14, 15, 16], 
            'tol': [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6], 
            'precompute_distances': ['auto', True, False], 
            'random_state': [42],
            'copy_x': [True, False],
            'n_jobs': [-1],
            'algorithm': ['auto', 'full', 'elkan']
           }

n_jobs=-1

cv,best_est=KMeanstest(param_grid, n_jobs)

但这非常耗时。我想知道这种方法是否最好，或者我是否需要使用其他方法。

感谢您的帮助

Answer 1

您可以尝试使用随机搜索代替网格搜索，随机搜索是一种使用超参数的随机组合来为构建模型找到最佳解决方案的技术。它会尝试一系列值的随机组合。为了通过随机搜索进行优化，函数在参数 space.

中的一些随机配置下进行评估

您可以在 sklearn documentation page 上找到详细信息。给出了随机搜索和网格搜索之间的比较。

希望你觉得这有用。

Answer 2

除了随机搜索和网格搜索之外，还有用于更智能的超参数调整的工具和库。我成功地使用了 Optuna，但还有更多。

Answer 3

GridSearch 的问题在于它非常耗时，正如您所说的那样。 RandomSearch 有时是一个不错的选择，但它不是最优的。

贝叶斯优化是另一种选择。这使我们能够使用概率方法快速确定最佳参数集。我已经使用 hyperopt library in python and it works really well. Check out this tutorial for more information. You can also download the associated notebook from my GitHub

亲自尝试过

好处是，由于您已经尝试过 GridSearch，因此您对哪些参数范围效果不佳有一个大概的了解。所以你可以定义一个更精确的搜索 space 用于贝叶斯优化到运行上，这样会减少更多的时间。此外，hyperopt 可用于比较多种算法及其各自的参数。

超参数调整

Hyperparameter tuning

python

machine-learning

scikit-learn

cross-validation

hyperparameters