如何为不同的分类列创建带有编码的管道?
How can I create a pipeline with encoding for different categorical columns?
我在尝试实现管道时遇到问题,我想在不同的分类列上使用 OrdinalEncoder 和 OneHotEncoder。
此时我的代码如下:
X = stroke_df.drop(columns=['id', 'smoking_status', 'stroke'])
y = stroke_df['stroke'].copy()
num_columns = X.select_dtypes(np.number).columns.tolist()
cat_columns = X.select_dtypes('object').columns.tolist()
all_columns = num_columns + cat_columns # this order will need to be preserved
print('Numerical columns:', ', '.join(num_columns))
print('Categorical columns:', ', '.join(cat_columns))
num_pipeline = Pipeline([
('imputer', SimpleImputer(missing_values=np.nan, strategy='median')),
('scaler', StandardScaler())
])
cat_pipeline = ColumnTransformer([
('label_encoder', LabelEncoder(), ['ever_married', 'work_type']),
('one_hot_encoder', OneHotEncoder(), ['gender', 'residence_type'])
])
pipeline = ColumnTransformer([
('num', num_pipeline, num_columns),
('cat', cat_pipeline, cat_columns)
])
然而,在尝试调用管道上的 fit_transform
并对输入特征矩阵进行预处理后,我得到了 TypeError:
X_prep = pipeline.fit_transform(X)
TypeError: fit_transform() takes 2 positional arguments but 3 were given
您的错误是由于在管道中使用了 LabelEncoder。 documentation 声明它应该只用于编码 y 变量。如果您的变量确实是有序的,请改用序号编码器,否则使用单热编码。下面的代码也使用了一个简单的管道。
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
# Set-up
df = pd.DataFrame({'gender': np.random.choice(['M', 'F'], size=5),
'ever_married': np.random.choice(['Y', 'N'], size=5),
'residence_type': list('ABCDE'),
'work_type': list('abcde'),
'num_col': np.array([1, 2, np.nan, 3, 4])})
ord_cols = ['ever_married', 'work_type']
ohe_cols = ['gender', 'residence_type']
num_cols = ['num_col']
# Preprocessing pipeline
num_pipeline = Pipeline([
('imputer', SimpleImputer(missing_values=np.nan, strategy='median')),
('scaler', StandardScaler())
])
pipeline = ColumnTransformer(
[
('num_imputer', num_pipeline, num_cols),
('ord_encoder', OrdinalEncoder(), ord_cols),
('ohe_encoder', OneHotEncoder(), ohe_cols)
]
)
# Preprocessing
X_prep = pipeline.fit_transform(df)
输出:
df
gender ever_married residence_type work_type num_col
0 M Y A a 1.0
1 F Y B b 2.0
2 F Y C c NaN
3 M Y D d 3.0
4 M N E e 4.0
X_prep
array([[-1.5, 1. , 0. , 0. , 1. , 1. , 0. , 0. , 0. , 0. ],
[-0.5, 1. , 1. , 1. , 0. , 0. , 1. , 0. , 0. , 0. ],
[ 0. , 1. , 2. , 1. , 0. , 0. , 0. , 1. , 0. , 0. ],
[ 0.5, 1. , 3. , 0. , 1. , 0. , 0. , 0. , 1. , 0. ],
[ 1.5, 0. , 4. , 0. , 1. , 0. , 0. , 0. , 0. , 1. ]])
我遇到了类似的问题,重点是...我只需要用不同的名称来命名我的变形金刚...就是这样。
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat_ordinal', categorical_transformer_OE, ordinal_cols),
('cat', categorical_transformer_OH, OH_cols)
])
我想我不能改变像“num”、“cat”这样的东西。我真是个白痴哈哈
(也许有人犯了类似的愚蠢错误,这可能会有所帮助:))
我在尝试实现管道时遇到问题,我想在不同的分类列上使用 OrdinalEncoder 和 OneHotEncoder。
此时我的代码如下:
X = stroke_df.drop(columns=['id', 'smoking_status', 'stroke'])
y = stroke_df['stroke'].copy()
num_columns = X.select_dtypes(np.number).columns.tolist()
cat_columns = X.select_dtypes('object').columns.tolist()
all_columns = num_columns + cat_columns # this order will need to be preserved
print('Numerical columns:', ', '.join(num_columns))
print('Categorical columns:', ', '.join(cat_columns))
num_pipeline = Pipeline([
('imputer', SimpleImputer(missing_values=np.nan, strategy='median')),
('scaler', StandardScaler())
])
cat_pipeline = ColumnTransformer([
('label_encoder', LabelEncoder(), ['ever_married', 'work_type']),
('one_hot_encoder', OneHotEncoder(), ['gender', 'residence_type'])
])
pipeline = ColumnTransformer([
('num', num_pipeline, num_columns),
('cat', cat_pipeline, cat_columns)
])
然而,在尝试调用管道上的 fit_transform
并对输入特征矩阵进行预处理后,我得到了 TypeError:
X_prep = pipeline.fit_transform(X)
TypeError: fit_transform() takes 2 positional arguments but 3 were given
您的错误是由于在管道中使用了 LabelEncoder。 documentation 声明它应该只用于编码 y 变量。如果您的变量确实是有序的,请改用序号编码器,否则使用单热编码。下面的代码也使用了一个简单的管道。
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
# Set-up
df = pd.DataFrame({'gender': np.random.choice(['M', 'F'], size=5),
'ever_married': np.random.choice(['Y', 'N'], size=5),
'residence_type': list('ABCDE'),
'work_type': list('abcde'),
'num_col': np.array([1, 2, np.nan, 3, 4])})
ord_cols = ['ever_married', 'work_type']
ohe_cols = ['gender', 'residence_type']
num_cols = ['num_col']
# Preprocessing pipeline
num_pipeline = Pipeline([
('imputer', SimpleImputer(missing_values=np.nan, strategy='median')),
('scaler', StandardScaler())
])
pipeline = ColumnTransformer(
[
('num_imputer', num_pipeline, num_cols),
('ord_encoder', OrdinalEncoder(), ord_cols),
('ohe_encoder', OneHotEncoder(), ohe_cols)
]
)
# Preprocessing
X_prep = pipeline.fit_transform(df)
输出:
df
gender ever_married residence_type work_type num_col
0 M Y A a 1.0
1 F Y B b 2.0
2 F Y C c NaN
3 M Y D d 3.0
4 M N E e 4.0
X_prep
array([[-1.5, 1. , 0. , 0. , 1. , 1. , 0. , 0. , 0. , 0. ],
[-0.5, 1. , 1. , 1. , 0. , 0. , 1. , 0. , 0. , 0. ],
[ 0. , 1. , 2. , 1. , 0. , 0. , 0. , 1. , 0. , 0. ],
[ 0.5, 1. , 3. , 0. , 1. , 0. , 0. , 0. , 1. , 0. ],
[ 1.5, 0. , 4. , 0. , 1. , 0. , 0. , 0. , 0. , 1. ]])
我遇到了类似的问题,重点是...我只需要用不同的名称来命名我的变形金刚...就是这样。
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat_ordinal', categorical_transformer_OE, ordinal_cols),
('cat', categorical_transformer_OH, OH_cols)
])
我想我不能改变像“num”、“cat”这样的东西。我真是个白痴哈哈
(也许有人犯了类似的愚蠢错误,这可能会有所帮助:))