如何将来自管道的预处理数据转换为数据帧?
How do I turn preprocessed data from pipelines into dataframes?
我有一段代码是我的数据的预处理文件。在我必须将预处理后的数据输入到采用 pandas 数据帧和数组的拟合函数之前,一切都是正确的。我怎样才能把这个训练数据变成一个数据框来喂养?从 pipeline.fit()
函数开始,数据类型是列转换器而不是 pandas df.
代码:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# generate the data
data = pd.DataFrame({
'y': [1, 2, 3, 4, 5],
'x1': [6, 7, 8, np.nan, np.nan],
'x2': [9, 10, 11, np.nan, np.nan],
'x3': ['a', 'b', 'c', np.nan, np.nan],
'x4': [np.nan, np.nan, 'd', 'e', 'f']
})
# extract the features and target
x = data.drop(labels=['y'], axis=1)
y = data['y']
# split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
# map the features to the corresponding types (numerical or categorical)
numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = x_train.select_dtypes(include=['object']).columns.tolist()
# define the numerical features pipeline
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# define the categorical features pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# define the overall pipeline
preprocessor_pipeline = ColumnTransformer(transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
# fit the pipeline to the training data
preprocessor_pipeline.fit(x_train)
# apply the pipeline to the training and test data
x_train_ = preprocessor_pipeline.transform(x_train)
x_test_ = preprocessor_pipeline.transform(x_test)
奖励:我是否也需要预处理我的标签 (y_train)?
要将管道结果转换为数据帧,您只需要:
x_train_df = pd.DataFrame(data=x_train_)
x_test_df = pd.DataFrame(data=x_test_)
由于您的标签 y 在大多数情况下已经是数字,因此不需要进一步的预处理。但这也取决于您要在下一步中使用的 ML 模型。
我有一段代码是我的数据的预处理文件。在我必须将预处理后的数据输入到采用 pandas 数据帧和数组的拟合函数之前,一切都是正确的。我怎样才能把这个训练数据变成一个数据框来喂养?从 pipeline.fit()
函数开始,数据类型是列转换器而不是 pandas df.
代码:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# generate the data
data = pd.DataFrame({
'y': [1, 2, 3, 4, 5],
'x1': [6, 7, 8, np.nan, np.nan],
'x2': [9, 10, 11, np.nan, np.nan],
'x3': ['a', 'b', 'c', np.nan, np.nan],
'x4': [np.nan, np.nan, 'd', 'e', 'f']
})
# extract the features and target
x = data.drop(labels=['y'], axis=1)
y = data['y']
# split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
# map the features to the corresponding types (numerical or categorical)
numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = x_train.select_dtypes(include=['object']).columns.tolist()
# define the numerical features pipeline
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# define the categorical features pipeline
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# define the overall pipeline
preprocessor_pipeline = ColumnTransformer(transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
# fit the pipeline to the training data
preprocessor_pipeline.fit(x_train)
# apply the pipeline to the training and test data
x_train_ = preprocessor_pipeline.transform(x_train)
x_test_ = preprocessor_pipeline.transform(x_test)
奖励:我是否也需要预处理我的标签 (y_train)?
要将管道结果转换为数据帧,您只需要:
x_train_df = pd.DataFrame(data=x_train_)
x_test_df = pd.DataFrame(data=x_test_)
由于您的标签 y 在大多数情况下已经是数字,因此不需要进一步的预处理。但这也取决于您要在下一步中使用的 ML 模型。