基于树的模型的最佳超参数调整
Optimal Hyper-parameter Tuning for Tree Based Models
我正在尝试生成 5 个机器学习模型并基于网格搜索调整它们 class 以便以最佳方式调整模型,以便我能够使用它们来预测新数据每天都会进来。问题是这样做的时间太长了。所以,我的问题是什么级别的参数调整是绝对必要的,但不会超过 2 小时?下面是我的调优代码和使用的 classifiers:
#Training and Test Sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .20,
random_state = 10)
#Classifiers
dtc = DecisionTreeClassifier()
randf = RandomForestClassifier()
bag = BaggingClassifier()
gradb = GradientBoostingClassifier()
knn = KNeighborsClassifier()
ada = AdaBoostClassifier()
#Hyperparamter Tuning for the Models being used
#Scoring Criteria
scoring = {'precision': make_scorer(precision_score), 'accuracy':
make_scorer(accuracy_score)}
#Grid Search for the Decision Tree
param_dtc = {'min_samples_split': np.arange(2, 10), 'min_samples_leaf':
np.arange(.05, .2), 'max_leaf_nodes': np.arange(2, 30)}
cv_dtc = GridSearchCV(estimator = dtc, param_grid = param_dtc, cv = 3,
scoring = scoring, refit='precision', n_jobs=-1)
#Grid Search for the Random Forest Model
param_randf = {'n_estimators': np.arange(10, 20), 'min_samples_split':
np.arange(2, 10), 'min_samples_leaf': np.arange(.15, .33), 'max_leaf_nodes':
np.arange(2, 30), 'bootstrap': ['True', 'False']}
cv_randf = GridSearchCV(estimator = randf, param_grid = param_randf, cv = 3,
scoring = scoring, refit='precision', n_jobs=-1)
#Grid Search for the Bagging Model
param_bag = {'n_estimators': np.arange(10, 30), 'max_samples': np.arange(2,
30), 'bootstrap': ['True', 'False'], 'bootstrap_features': ['True',
'False']}
cv_bag = GridSearchCV(estimator = bag, param_grid = param_bag, cv = 3,
scoring = scoring, refit='precision', n_jobs=-1)
#Grid Search for the Gradient Boosting Model
param_gradb = {'loss': ['deviance', 'exponential'], 'learning_rate':
np.arange(.05, .1), 'max_depth': np.arange(2, 10), 'min_samples_split':
np.arange(2, 10), 'min_samples_leaf': np.arange(.15, .33), 'max_leaf_nodes':
np.arange(2, 30)}
cv_gradb = GridSearchCV(estimator = gradb, param_grid = param_gradb, cv = 3,
scoring = scoring, refit='precision', n_jobs=-1)
#Grid Search for the Adaptive Boosting Model
param_ada = {'n_estimators': np.arange(10, 30), 'learning_rate':
np.arange(.05, .1)}
cv_ada = GridSearchCV(estimator = ada, param_grid = param_ada, cv = 3,
scoring = scoring, refit='precision', n_jobs=-1)
train_dict = {'dtc': cv_dtc.fit(x_train, y_train), 'randf':
cv_randf.fit(x_train, y_train), 'bag': cv_bag.fit(x_train, y_train),
'gradb': cv_gradb.fit(x_train, y_train), 'ada': cv_ada.fit(x_train,
y_train)}
您可以考虑进行一些迭代网格搜索。例如,不要将 'n_estimators' 设置为 np.arange(10,30),而是将其设置为 [10,15,20,25,30]。是最优参数15,继续[11,13,15,17,19]。您将找到一种使此过程自动化的方法。这样可以节省很多时间。
玩转你的数据。您正在调整很多超参数。 'min_samples_split'、'min_samples_leaf'和'max_leaf_nodes'的效果在决策树中有交集。可能没有必要定义所有这些。
我正在尝试生成 5 个机器学习模型并基于网格搜索调整它们 class 以便以最佳方式调整模型,以便我能够使用它们来预测新数据每天都会进来。问题是这样做的时间太长了。所以,我的问题是什么级别的参数调整是绝对必要的,但不会超过 2 小时?下面是我的调优代码和使用的 classifiers:
#Training and Test Sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .20,
random_state = 10)
#Classifiers
dtc = DecisionTreeClassifier()
randf = RandomForestClassifier()
bag = BaggingClassifier()
gradb = GradientBoostingClassifier()
knn = KNeighborsClassifier()
ada = AdaBoostClassifier()
#Hyperparamter Tuning for the Models being used
#Scoring Criteria
scoring = {'precision': make_scorer(precision_score), 'accuracy':
make_scorer(accuracy_score)}
#Grid Search for the Decision Tree
param_dtc = {'min_samples_split': np.arange(2, 10), 'min_samples_leaf':
np.arange(.05, .2), 'max_leaf_nodes': np.arange(2, 30)}
cv_dtc = GridSearchCV(estimator = dtc, param_grid = param_dtc, cv = 3,
scoring = scoring, refit='precision', n_jobs=-1)
#Grid Search for the Random Forest Model
param_randf = {'n_estimators': np.arange(10, 20), 'min_samples_split':
np.arange(2, 10), 'min_samples_leaf': np.arange(.15, .33), 'max_leaf_nodes':
np.arange(2, 30), 'bootstrap': ['True', 'False']}
cv_randf = GridSearchCV(estimator = randf, param_grid = param_randf, cv = 3,
scoring = scoring, refit='precision', n_jobs=-1)
#Grid Search for the Bagging Model
param_bag = {'n_estimators': np.arange(10, 30), 'max_samples': np.arange(2,
30), 'bootstrap': ['True', 'False'], 'bootstrap_features': ['True',
'False']}
cv_bag = GridSearchCV(estimator = bag, param_grid = param_bag, cv = 3,
scoring = scoring, refit='precision', n_jobs=-1)
#Grid Search for the Gradient Boosting Model
param_gradb = {'loss': ['deviance', 'exponential'], 'learning_rate':
np.arange(.05, .1), 'max_depth': np.arange(2, 10), 'min_samples_split':
np.arange(2, 10), 'min_samples_leaf': np.arange(.15, .33), 'max_leaf_nodes':
np.arange(2, 30)}
cv_gradb = GridSearchCV(estimator = gradb, param_grid = param_gradb, cv = 3,
scoring = scoring, refit='precision', n_jobs=-1)
#Grid Search for the Adaptive Boosting Model
param_ada = {'n_estimators': np.arange(10, 30), 'learning_rate':
np.arange(.05, .1)}
cv_ada = GridSearchCV(estimator = ada, param_grid = param_ada, cv = 3,
scoring = scoring, refit='precision', n_jobs=-1)
train_dict = {'dtc': cv_dtc.fit(x_train, y_train), 'randf':
cv_randf.fit(x_train, y_train), 'bag': cv_bag.fit(x_train, y_train),
'gradb': cv_gradb.fit(x_train, y_train), 'ada': cv_ada.fit(x_train,
y_train)}
您可以考虑进行一些迭代网格搜索。例如,不要将 'n_estimators' 设置为 np.arange(10,30),而是将其设置为 [10,15,20,25,30]。是最优参数15,继续[11,13,15,17,19]。您将找到一种使此过程自动化的方法。这样可以节省很多时间。
玩转你的数据。您正在调整很多超参数。 'min_samples_split'、'min_samples_leaf'和'max_leaf_nodes'的效果在决策树中有交集。可能没有必要定义所有这些。