如何将 scikit 的 preprocessing/normalization 与交叉验证一起使用?

How to use scikit's preprocessing/normalization along with cross validation?

作为一个没有任何预处理的交叉验证的例子,我可以这样做:

    tuned_params = [{"penalty" : ["l2", "l1"]}]
    from sklearn.linear_model import SGDClassifier
    SGD = SGDClassifier()
    from sklearn.grid_search import GridSearchCV
    clf = GridSearchCV(myClassifier, params, verbose=5)
    clf.fit(x_train, y_train)

我想使用

之类的方法预处理我的数据
from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)

但在设置交叉验证之前这样做并不是一个好主意,因为那样训练集和测试集将一起归一化。如何设置交叉验证以在每个 运行 上分别预处理相应的训练和测试集?

根据文档,如果您雇用 Pipeline,这可以为您完成。来自 docs,就在第 3.1.1.1 节的上方,强调我的:

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]

有关可用管道的更多相关信息here