如何使用 sklearn 管道执行并行和串行转换？

Question

我想使用 sklearn 的管道执行一些像这张图这样的预处理。

如果我离开标准化步骤，我可以毫无问题地做到这一点。但是我不明白如何指示插补步骤的输出应该流向标准化步骤。

这是没有标准化步骤的当前代码：

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric_imputation", NumericImputation(), dq.numeric_variables),
        ("onehot", OneHotEncoder(handle_unknown="ignore"), dq.categorical_variables),
    ],
    remainder="passthrough",
)

bp2 = make_pipeline(
    preprocessor, ElasticNet()
)

Answer 1

事实是 ColumnTransformer 将其变换器并行应用于您传递给它的数据集。因此，如果您将标准化数字数据的转换器添加为转换器列表中的第二步，这将不适用于插补的输出，而是应用于初始数据集。

解决此类问题的一种可能性是将数字列上的转换包含在 Pipeline.

中

preprocessor = ColumnTransformer([
    ('num_pipe', Pipeline([('numeric_imputation', NumericImputation()),
                           ('standardizer', YourStandardizer())]), dq.numeric_variables),
    ('onehot', OneHotEncoder(handle_unknown="ignore"), dq.categorical_variables)],
remainder = 'passthrough')

我建议您阅读以下关于类似主题的帖子：

（您会在其中找到其他一些链接）。

如何使用 sklearn 管道执行并行和串行转换？

How to execute both parallel and serial transformations with sklearn pipeline?

python

scikit-learn