如何将 scikit 的 preprocessing/normalization 与交叉验证一起使用?
How to use scikit's preprocessing/normalization along with cross validation?
作为一个没有任何预处理的交叉验证的例子,我可以这样做:
tuned_params = [{"penalty" : ["l2", "l1"]}]
from sklearn.linear_model import SGDClassifier
SGD = SGDClassifier()
from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(myClassifier, params, verbose=5)
clf.fit(x_train, y_train)
我想使用
之类的方法预处理我的数据
from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)
但在设置交叉验证之前这样做并不是一个好主意,因为那样训练集和测试集将一起归一化。如何设置交叉验证以在每个 运行 上分别预处理相应的训练和测试集?
根据文档,如果您雇用 Pipeline
,这可以为您完成。来自 docs,就在第 3.1.1.1 节的上方,强调我的:
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]
有关可用管道的更多相关信息here。
作为一个没有任何预处理的交叉验证的例子,我可以这样做:
tuned_params = [{"penalty" : ["l2", "l1"]}]
from sklearn.linear_model import SGDClassifier
SGD = SGDClassifier()
from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(myClassifier, params, verbose=5)
clf.fit(x_train, y_train)
我想使用
之类的方法预处理我的数据from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)
但在设置交叉验证之前这样做并不是一个好主意,因为那样训练集和测试集将一起归一化。如何设置交叉验证以在每个 运行 上分别预处理相应的训练和测试集?
根据文档,如果您雇用 Pipeline
,这可以为您完成。来自 docs,就在第 3.1.1.1 节的上方,强调我的:
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]
有关可用管道的更多相关信息here。