如何在 scikit-learn 中一致地标准化稀疏特征矩阵？

How to consistently standardize sparse feature matrix in scikit-learn?

我正在使用 sklearn 的 DictVectorizer 构建一个大型稀疏特征矩阵，并将其馈送到 ElasticNet 模型。当预测变量（特征矩阵中的列）居中和缩放时，弹性网络（和类似的线性模型）效果最佳。 recommended approach is to build a Pipeline that uses a StandardScaler prior to the regressor, however that doesn't work with sparse features, as stated in the docs。

我想在 ElasticNet 中使用 normalize=True 标志，它似乎支持稀疏数据，但是不清楚在预测过程中是否也对测试数据应用了归一化。有谁知道 normalize=True 是否也适用于预测？如果不是，有没有办法在处理稀疏特征时对训练集和测试集使用相同的标准化？

通过 sklearn 代码挖掘，看起来当 fit_intercept=True 和 normalize=True 时，在归一化数据上估计的系数被投影回数据的原始比例。这类似于 R 中 glmnet 处理标准化的方式。相关代码片段是LinearModel的方法_set_intercept，参见https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L158。因此，对未见数据的预测使用原始比例中的系数，即 normalize=True 可以安全使用。

如何在 scikit-learn 中一致地标准化稀疏特征矩阵？

How to consistently standardize sparse feature matrix in scikit-learn?

sparse-matrix

scikit-learn