带有自定义转换器 Class 的管道在使用 Featureunion 的完整管道中不起作用
Pipeline with Custom Transformer Class does not work within a full Pipeline using Featureunion
我正在准备来自德国信用数据集 (https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) 的数据,我构建了一个自定义转换器来从数据集中的属性中提取特征,它仅在一个小型管道中工作。
自定义转换器 (AddGenderStatus) 将性别和状态添加为特征。
当我使用 FeatureUnion 将此管道放入完整管道时出现问题。
KeyError: “['gender', 'status'] 不在索引中
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.model_selection import train_test_split
# %% set column names
attributes=['checking_balance',
'months_loan_duration',
'credit_history',
'purpose',
'amount',
'savings_balance',
'employment_duration',
'installment_rate_income',
'status_gender',
'debtors_guarantors',
'residence_years',
'property',
'age',
'other_installment',
'housing',
'existing_loans_count',
'job',
'dependents',
'phone',
'class']
# %% load the data
# https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
url ='https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data'
credit = pd.read_csv(url, sep=' ',header=None, names=attributes, index_col=False)
# %% Split the data
X=credit.drop('class', axis=1)
y=credit['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# check class balance
y.value_counts()/len(y)
y_train.value_counts()/len(y_train)
y_test.value_counts()/len(y_test)
# %% Calss to extract gender and status features
""" Attribute 9: (qualitative)
Personal status and sex - status_sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : (female : single - does not exist)
"""
class AddGenderStatus(TransformerMixin, BaseEstimator):
def __init__(self, key):
# key is the column name as str
self.key = key
def fit(self,X,y=None):
return self
def transform(self,X):
function_gender = lambda x:'male'if x=='A91'or x=='A93'or x=='A94' else 'female'
function_status = lambda x: 'divorced' if x=='A91' else ('married' if x=='A92' or x=='A94' else 'single')
X_new = X.copy()
X_new["status"] = X[self.key].map(function_status)
X_new["gender"] = X[self.key].map(function_gender)
X_new.drop([self.key], axis=1,inplace=True)
return X_new
# %% Pipeline new_attribs
gender_status_attribs = Pipeline([
('AddGenderStatus',AddGenderStatus(key='status_gender'))
])
X_train_check = gender_status_attribs.transform(X_train)
'gender' and 'status' in list(X_train_check) #True
# %% Create a class to select numerical or categorical columns
class ColumnExtractor(BaseEstimator,TransformerMixin):
def __init__(self, key):
# key is the column name as str
self.key = key
def fit(self, X, y=None):
# stateless transformer
return self
def transform(self, X):
# assumes X is a DataFrame
return X[self.key]
# %% Encoding categorical data
cat_attribs = ['checking_balance',
'credit_history',
'purpose',
'savings_balance',
'employment_duration',
'debtors_guarantors',
'property',
'other_installment',
'housing',
'job',
'phone',
'gender',
'status']
# %% Pipeline categorical
categorical_attribs = Pipeline([
('selector', ColumnExtractor(key=cat_attribs)),
('encoder',OneHotEncoder(drop='first',sparse=False,))
])
# %% Full Pipeline
full_pipeline = FeatureUnion(transformer_list=[("gender_status_attribs", gender_status_attribs),
("categorical_attribs", categorical_attribs),
])
X_train_prepared=full_pipeline.transform(X_train)
# KeyError: "['gender', 'status'] not in index"
我在您的代码中发现了 2 个问题:
而不是做 FeatureUnion
你需要做一个 Pipeline
因为第二个转换器期望来自第一个的输入(如果你确实期望一个特征联合你需要做一个FeatureUnion
和 Pipeline
)
的组合
full_pipeline.transform
应该变成 full_pipeline.fit_transform
当您将这些行更改为:
# %% Full Pipeline
full_pipeline = Pipeline([("gender_status_attribs", gender_status_attribs),
("categorical_attribs", categorical_attribs),
])
X_train_prepared=full_pipeline.fit_transform(X_train)
您的代码 运行 不会出错。
编辑
如果你坚持使用FeatureUnion
你可以考虑:
ppl = Pipeline([("gender_status", gender_status_attribs),("categorical_attribs", categorical_attribs)])
full_pipeline = FeatureUnion([("gender_status_attribs", gender_status_attribs),("pipeline",ppl)])
X_train_prepared=full_pipeline.fit_transform(X_train)
我正在准备来自德国信用数据集 (https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) 的数据,我构建了一个自定义转换器来从数据集中的属性中提取特征,它仅在一个小型管道中工作。 自定义转换器 (AddGenderStatus) 将性别和状态添加为特征。 当我使用 FeatureUnion 将此管道放入完整管道时出现问题。
KeyError: “['gender', 'status'] 不在索引中
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.model_selection import train_test_split
# %% set column names
attributes=['checking_balance',
'months_loan_duration',
'credit_history',
'purpose',
'amount',
'savings_balance',
'employment_duration',
'installment_rate_income',
'status_gender',
'debtors_guarantors',
'residence_years',
'property',
'age',
'other_installment',
'housing',
'existing_loans_count',
'job',
'dependents',
'phone',
'class']
# %% load the data
# https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
url ='https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data'
credit = pd.read_csv(url, sep=' ',header=None, names=attributes, index_col=False)
# %% Split the data
X=credit.drop('class', axis=1)
y=credit['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# check class balance
y.value_counts()/len(y)
y_train.value_counts()/len(y_train)
y_test.value_counts()/len(y_test)
# %% Calss to extract gender and status features
""" Attribute 9: (qualitative)
Personal status and sex - status_sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : (female : single - does not exist)
"""
class AddGenderStatus(TransformerMixin, BaseEstimator):
def __init__(self, key):
# key is the column name as str
self.key = key
def fit(self,X,y=None):
return self
def transform(self,X):
function_gender = lambda x:'male'if x=='A91'or x=='A93'or x=='A94' else 'female'
function_status = lambda x: 'divorced' if x=='A91' else ('married' if x=='A92' or x=='A94' else 'single')
X_new = X.copy()
X_new["status"] = X[self.key].map(function_status)
X_new["gender"] = X[self.key].map(function_gender)
X_new.drop([self.key], axis=1,inplace=True)
return X_new
# %% Pipeline new_attribs
gender_status_attribs = Pipeline([
('AddGenderStatus',AddGenderStatus(key='status_gender'))
])
X_train_check = gender_status_attribs.transform(X_train)
'gender' and 'status' in list(X_train_check) #True
# %% Create a class to select numerical or categorical columns
class ColumnExtractor(BaseEstimator,TransformerMixin):
def __init__(self, key):
# key is the column name as str
self.key = key
def fit(self, X, y=None):
# stateless transformer
return self
def transform(self, X):
# assumes X is a DataFrame
return X[self.key]
# %% Encoding categorical data
cat_attribs = ['checking_balance',
'credit_history',
'purpose',
'savings_balance',
'employment_duration',
'debtors_guarantors',
'property',
'other_installment',
'housing',
'job',
'phone',
'gender',
'status']
# %% Pipeline categorical
categorical_attribs = Pipeline([
('selector', ColumnExtractor(key=cat_attribs)),
('encoder',OneHotEncoder(drop='first',sparse=False,))
])
# %% Full Pipeline
full_pipeline = FeatureUnion(transformer_list=[("gender_status_attribs", gender_status_attribs),
("categorical_attribs", categorical_attribs),
])
X_train_prepared=full_pipeline.transform(X_train)
# KeyError: "['gender', 'status'] not in index"
我在您的代码中发现了 2 个问题:
而不是做
的组合FeatureUnion
你需要做一个Pipeline
因为第二个转换器期望来自第一个的输入(如果你确实期望一个特征联合你需要做一个FeatureUnion
和Pipeline
)full_pipeline.transform
应该变成full_pipeline.fit_transform
当您将这些行更改为:
# %% Full Pipeline
full_pipeline = Pipeline([("gender_status_attribs", gender_status_attribs),
("categorical_attribs", categorical_attribs),
])
X_train_prepared=full_pipeline.fit_transform(X_train)
您的代码 运行 不会出错。
编辑
如果你坚持使用FeatureUnion
你可以考虑:
ppl = Pipeline([("gender_status", gender_status_attribs),("categorical_attribs", categorical_attribs)])
full_pipeline = FeatureUnion([("gender_status_attribs", gender_status_attribs),("pipeline",ppl)])
X_train_prepared=full_pipeline.fit_transform(X_train)