在 GridSearchCV 中明确指定 test/train 集合
Explicitly specifying test/train sets in GridSearchCV
我对 sklearn 的 GridSearchCV
的 cv
参数有疑问。
我正在处理具有时间成分的数据,因此我认为 KFold 交叉验证中的随机改组似乎不明智。
相反,我想在 GridSearchCV
中明确指定训练、验证和测试数据的截止值。我可以这样做吗?
为了更好地阐明问题,以下是我手动解决该问题的方法。
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
np.random.seed(444)
index = pd.date_range('2014', periods=60, freq='M')
X, y = make_regression(n_samples=60, n_features=3, random_state=444, noise=90.)
X = pd.DataFrame(X, index=index, columns=list('abc'))
y = pd.Series(y, index=index, name='y')
# Train on the first 30 samples, validate on the next 10, test on
# the final 10.
X_train, X_val, X_test = np.array_split(X, [35, 50])
y_train, y_val, y_test = np.array_split(y, [35, 50])
param_grid = {'alpha': np.linspace(0, 1, 11)}
model = None
best_param_ = None
best_score_ = -np.inf
# Manual implementation
for alpha in param_grid['alpha']:
ridge = Ridge(random_state=444, alpha=alpha).fit(X_train, y_train)
score = ridge.score(X_val, y_val)
if score > best_score_:
best_score_ = score
best_param_ = alpha
model = ridge
print('Optimal alpha parameter: {:0.2f}'.format(best_param_))
print('Best score (on validation data): {:0.2f}'.format(best_score_))
print('Test set score: {:.2f}'.format(model.score(X_test, y_test)))
# Optimal alpha parameter: 1.00
# Best score (on validation data): 0.64
# Test set score: 0.22
这里的流程是:
- 对于X和Y,我都想要训练集、验证集和测试集。训练集是时间序列中的前 35 个样本。验证集是接下来的 15 个样本。测试集是最后的10.
- 训练集和验证集用于确定 Ridge 回归中的最佳
alpha
参数。在这里我测试 alpha
s of (0.0, 0.1, ..., 0.9, 1.0).
- 测试集作为未见数据用于 "actual" 测试。
总之...我似乎想做这样的事情,但不确定要传递给 cv
的内容:
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv= ???)
grid_search.fit(...?)
我无法解释的文档指定:
cv
: int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs
for cv are:
- None, to use the default 3-fold cross validation,
- integer, to specify the number of folds in a (Stratified)KFold,
- An object to be used as a cross-validation generator.
- An iterable yielding train, test splits.
For integer/None inputs, if the estimator is a classifier and y is
either binary or multiclass, StratifiedKFold is used. In all other
cases, KFold is used.
正如@MaxU 所说,最好让 GridSearchCV 处理拆分,但是如果你想按照你在问题中设置的那样强制执行拆分,那么你可以使用 PredefinedSplit
来做到这一点东西.
因此您需要对代码进行以下更改。
# Here X_test, y_test is the untouched data
# Validation data (X_val, y_val) is currently inside X_train, which will be split using PredefinedSplit inside GridSearchCV
X_train, X_test = np.array_split(X, [50])
y_train, y_test = np.array_split(y, [50])
# The indices which have the value -1 will be kept in train.
train_indices = np.full((35,), -1, dtype=int)
# The indices which have zero or positive values, will be kept in test
test_indices = np.full((15,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)
print(test_fold)
# OUTPUT:
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
from sklearn.model_selection import PredefinedSplit
ps = PredefinedSplit(test_fold)
# Check how many splits will be done, based on test_fold
ps.get_n_splits()
# OUTPUT: 1
for train_index, test_index in ps.split():
print("TRAIN:", train_index, "TEST:", test_index)
# OUTPUT:
('TRAIN:', array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34]),
'TEST:', array([35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]))
# And now, send this `ps` to cv param in GridSearchCV
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv=ps)
# Here, send the X_train and y_train
grid_search.fit(X_train, y_train)
发送到 fit()
的 X_train、y_train 将使用我们定义的拆分拆分为训练和测试(在您的情况下为 val),因此,Ridge 将被训练基于索引 [0:35] 的原始数据并在 [35:50].
上进行测试
希望这能清除工作。
你试过了吗TimeSeriesSplit?
它是明确用于拆分时间序列数据的。
tscv = TimeSeriesSplit(n_splits=3)
grid_search = GridSearchCV(clf, param_grid, cv=tscv.split(X))
在时间序列数据中,Kfold 不是正确的方法,因为 kfold cv 会打乱你的数据,你会失去序列中的模式。这是一个方法
import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import numpy as np
X = np.array([[4, 5, 6, 1, 0, 2], [3.1, 3.5, 1.0, 2.1, 8.3, 1.1]]).T
y = np.array([1, 6, 7, 1, 2, 3])
tscv = TimeSeriesSplit(n_splits=2)
model = xgb.XGBRegressor()
param_search = {'max_depth' : [3, 5]}
my_cv = TimeSeriesSplit(n_splits=2).split(X)
gsearch = GridSearchCV(estimator=model, cv=my_cv,
param_grid=param_search)
gsearch.fit(X, y)
参考 -
我对 sklearn 的 GridSearchCV
的 cv
参数有疑问。
我正在处理具有时间成分的数据,因此我认为 KFold 交叉验证中的随机改组似乎不明智。
相反,我想在 GridSearchCV
中明确指定训练、验证和测试数据的截止值。我可以这样做吗?
为了更好地阐明问题,以下是我手动解决该问题的方法。
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
np.random.seed(444)
index = pd.date_range('2014', periods=60, freq='M')
X, y = make_regression(n_samples=60, n_features=3, random_state=444, noise=90.)
X = pd.DataFrame(X, index=index, columns=list('abc'))
y = pd.Series(y, index=index, name='y')
# Train on the first 30 samples, validate on the next 10, test on
# the final 10.
X_train, X_val, X_test = np.array_split(X, [35, 50])
y_train, y_val, y_test = np.array_split(y, [35, 50])
param_grid = {'alpha': np.linspace(0, 1, 11)}
model = None
best_param_ = None
best_score_ = -np.inf
# Manual implementation
for alpha in param_grid['alpha']:
ridge = Ridge(random_state=444, alpha=alpha).fit(X_train, y_train)
score = ridge.score(X_val, y_val)
if score > best_score_:
best_score_ = score
best_param_ = alpha
model = ridge
print('Optimal alpha parameter: {:0.2f}'.format(best_param_))
print('Best score (on validation data): {:0.2f}'.format(best_score_))
print('Test set score: {:.2f}'.format(model.score(X_test, y_test)))
# Optimal alpha parameter: 1.00
# Best score (on validation data): 0.64
# Test set score: 0.22
这里的流程是:
- 对于X和Y,我都想要训练集、验证集和测试集。训练集是时间序列中的前 35 个样本。验证集是接下来的 15 个样本。测试集是最后的10.
- 训练集和验证集用于确定 Ridge 回归中的最佳
alpha
参数。在这里我测试alpha
s of (0.0, 0.1, ..., 0.9, 1.0). - 测试集作为未见数据用于 "actual" 测试。
总之...我似乎想做这样的事情,但不确定要传递给 cv
的内容:
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv= ???)
grid_search.fit(...?)
我无法解释的文档指定:
cv
: int, cross-validation generator or an iterable, optionalDetermines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 3-fold cross validation,
- integer, to specify the number of folds in a (Stratified)KFold,
- An object to be used as a cross-validation generator.
- An iterable yielding train, test splits.
For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.
正如@MaxU 所说,最好让 GridSearchCV 处理拆分,但是如果你想按照你在问题中设置的那样强制执行拆分,那么你可以使用 PredefinedSplit
来做到这一点东西.
因此您需要对代码进行以下更改。
# Here X_test, y_test is the untouched data
# Validation data (X_val, y_val) is currently inside X_train, which will be split using PredefinedSplit inside GridSearchCV
X_train, X_test = np.array_split(X, [50])
y_train, y_test = np.array_split(y, [50])
# The indices which have the value -1 will be kept in train.
train_indices = np.full((35,), -1, dtype=int)
# The indices which have zero or positive values, will be kept in test
test_indices = np.full((15,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)
print(test_fold)
# OUTPUT:
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
from sklearn.model_selection import PredefinedSplit
ps = PredefinedSplit(test_fold)
# Check how many splits will be done, based on test_fold
ps.get_n_splits()
# OUTPUT: 1
for train_index, test_index in ps.split():
print("TRAIN:", train_index, "TEST:", test_index)
# OUTPUT:
('TRAIN:', array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34]),
'TEST:', array([35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]))
# And now, send this `ps` to cv param in GridSearchCV
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv=ps)
# Here, send the X_train and y_train
grid_search.fit(X_train, y_train)
发送到 fit()
的 X_train、y_train 将使用我们定义的拆分拆分为训练和测试(在您的情况下为 val),因此,Ridge 将被训练基于索引 [0:35] 的原始数据并在 [35:50].
希望这能清除工作。
你试过了吗TimeSeriesSplit?
它是明确用于拆分时间序列数据的。
tscv = TimeSeriesSplit(n_splits=3)
grid_search = GridSearchCV(clf, param_grid, cv=tscv.split(X))
在时间序列数据中,Kfold 不是正确的方法,因为 kfold cv 会打乱你的数据,你会失去序列中的模式。这是一个方法
import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import numpy as np
X = np.array([[4, 5, 6, 1, 0, 2], [3.1, 3.5, 1.0, 2.1, 8.3, 1.1]]).T
y = np.array([1, 6, 7, 1, 2, 3])
tscv = TimeSeriesSplit(n_splits=2)
model = xgb.XGBRegressor()
param_search = {'max_depth' : [3, 5]}
my_cv = TimeSeriesSplit(n_splits=2).split(X)
gsearch = GridSearchCV(estimator=model, cv=my_cv,
param_grid=param_search)
gsearch.fit(X, y)
参考 -