如何将 scikit 的 preprocessing/normalization 与交叉验证一起使用？

Question

作为一个没有任何预处理的交叉验证的例子，我可以这样做：

    tuned_params = [{"penalty" : ["l2", "l1"]}]
    from sklearn.linear_model import SGDClassifier
    SGD = SGDClassifier()
    from sklearn.grid_search import GridSearchCV
    clf = GridSearchCV(myClassifier, params, verbose=5)
    clf.fit(x_train, y_train)

我想使用

之类的方法预处理我的数据

from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)

但在设置交叉验证之前这样做并不是一个好主意，因为那样训练集和测试集将一起归一化。如何设置交叉验证以在每个运行上分别预处理相应的训练和测试集？

Answer 1

根据文档，如果您雇用 Pipeline，这可以为您完成。来自 docs，就在第 3.1.1.1 节的上方，强调我的：

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]

有关可用管道的更多相关信息here。

如何将 scikit 的 preprocessing/normalization 与交叉验证一起使用？

How to use scikit's preprocessing/normalization along with cross validation?

python

scikit-learn