如何将管道中的预处理数据作为对象输出？

Question

在许多 sklearn 管道示例中，我看到人们使用另一个管道将预处理管道传输到某个线性回归模型。是否可以只从管道输出预处理数据，以便我可以将其输入到我的 flaml 基线代码中：

automl.fit(X_train=pp_training_data, y_train=pp_training_labels, **automl_settings)

这是我希望预处理管道代码的外观和行为方式（我知道这行不通）：

def diamond_preprocess(data_dir):
    data = pd.read_csv(data_dir)
    cleaned_data = data.drop(['id', 'depth_percent'], axis=1)  # Features I don't want

    x = cleaned_data.drop(['price'], axis=1)  # Train data
    y = cleaned_data['price']  # Label data

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

    numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns
    categorical_features = x_train.select_dtypes(include=['object']).columns

    numerical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),  # Fill in missing data with median
        ('scaler', StandardScaler())  # Scale data
    ])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Fill in missing data with 'missing'
        ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One hot encode categorical data
    ])

    preprocessor_pipeline = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_features),
            ('cat', categorical_transformer, categorical_features)
        ])

    pp_training_data, pp_training_label = preprocessor_pipeline

    return pp_training_data, pp_training_label

Answer 1

您可以仅将流水线应用于特征矩阵，而无需在最后一步中使用估算器。有关示例，请参见下面的代码。

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# generate the data
data = pd.DataFrame({
    'y':  [1, 2, 3, 4, 5],
    'x1': [6, 7, 8, np.nan, np.nan],
    'x2': [9, 10, 11, np.nan, np.nan],
    'x3': ['a', 'b', 'c', np.nan, np.nan],
    'x4': [np.nan, np.nan, 'd', 'e', 'f']
})

# extract the features and target
x = data.drop(labels=['y'], axis=1)
y = data['y']

# split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# map the features to the corresponding types (numerical or categorical)
numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = x_train.select_dtypes(include=['object']).columns.tolist()

# define the numerical features pipeline
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# define the categorical features pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# define the overall pipeline
preprocessor_pipeline = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
])

# fit the pipeline to the training data
preprocessor_pipeline.fit(x_train)

# apply the pipeline to the training and test data
x_train_ = preprocessor_pipeline.transform(x_train)
x_test_ = preprocessor_pipeline.transform(x_test)

如何将管道中的预处理数据作为对象输出？

How do you output preprocessed data from a pipeline as objects?

python

scikit-learn