如何将来自管道的预处理数据转换为数据帧？

Question

我有一段代码是我的数据的预处理文件。在我必须将预处理后的数据输入到采用 pandas 数据帧和数组的拟合函数之前，一切都是正确的。我怎样才能把这个训练数据变成一个数据框来喂养？从 pipeline.fit() 函数开始，数据类型是列转换器而不是 pandas df.

代码：

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# generate the data
data = pd.DataFrame({
    'y':  [1, 2, 3, 4, 5],
    'x1': [6, 7, 8, np.nan, np.nan],
    'x2': [9, 10, 11, np.nan, np.nan],
    'x3': ['a', 'b', 'c', np.nan, np.nan],
    'x4': [np.nan, np.nan, 'd', 'e', 'f']
})

# extract the features and target
x = data.drop(labels=['y'], axis=1)
y = data['y']

# split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# map the features to the corresponding types (numerical or categorical)
numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = x_train.select_dtypes(include=['object']).columns.tolist()

# define the numerical features pipeline
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# define the categorical features pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# define the overall pipeline
preprocessor_pipeline = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
])

# fit the pipeline to the training data
preprocessor_pipeline.fit(x_train)

# apply the pipeline to the training and test data
x_train_ = preprocessor_pipeline.transform(x_train)
x_test_ = preprocessor_pipeline.transform(x_test)

奖励：我是否也需要预处理我的标签 (y_train)？

Answer 1

要将管道结果转换为数据帧，您只需要：

x_train_df = pd.DataFrame(data=x_train_)
x_test_df = pd.DataFrame(data=x_test_)

由于您的标签 y 在大多数情况下已经是数字，因此不需要进一步的预处理。但这也取决于您要在下一步中使用的 ML 模型。

如何将来自管道的预处理数据转换为数据帧？

How do I turn preprocessed data from pipelines into dataframes?

python

dataframe

pandas

scikit-learn