带有自定义转换器 Class 的管道在使用 Featureunion 的完整管道中不起作用

Question

我正在准备来自德国信用数据集 (https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) 的数据，我构建了一个自定义转换器来从数据集中的属性中提取特征，它仅在一个小型管道中工作。自定义转换器 (AddGenderStatus) 将性别和状态添加为特征。当我使用 FeatureUnion 将此管道放入完整管道时出现问题。

KeyError: “['gender', 'status'] 不在索引中

import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.model_selection import train_test_split
# %% set column names
attributes=['checking_balance',
           'months_loan_duration',
           'credit_history',
           'purpose',
           'amount',
           'savings_balance',
           'employment_duration',
           'installment_rate_income',
           'status_gender',
           'debtors_guarantors',
           'residence_years',
           'property',
           'age',
           'other_installment',
           'housing',
           'existing_loans_count',
           'job',
           'dependents',
           'phone',
           'class']
# %% load the data
# https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
url ='https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data'
credit = pd.read_csv(url, sep=' ',header=None, names=attributes, index_col=False)
# %% Split the data
X=credit.drop('class', axis=1)
y=credit['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# check class balance
y.value_counts()/len(y)
y_train.value_counts()/len(y_train)
y_test.value_counts()/len(y_test)

# %% Calss to extract gender and status features

"""       Attribute 9:  (qualitative)
          Personal status and sex  - status_sex
          A91 : male   : divorced/separated
          A92 : female : divorced/separated/married
          A93 : male   : single
          A94 : male   : married/widowed
          A95 : (female : single - does not exist) 
"""
class AddGenderStatus(TransformerMixin, BaseEstimator):
    def __init__(self, key):
        # key is the column name as str
        self.key = key
        
    def fit(self,X,y=None):
        return self
    
    def transform(self,X):
            function_gender = lambda x:'male'if x=='A91'or x=='A93'or x=='A94' else 'female'
            function_status = lambda x: 'divorced' if x=='A91' else ('married' if x=='A92' or x=='A94' else 'single')
            X_new = X.copy()
            X_new["status"] = X[self.key].map(function_status)
            X_new["gender"] = X[self.key].map(function_gender)
            X_new.drop([self.key], axis=1,inplace=True)
            return X_new

# %% Pipeline new_attribs
gender_status_attribs = Pipeline([
                        ('AddGenderStatus',AddGenderStatus(key='status_gender'))
                        ])
X_train_check = gender_status_attribs.transform(X_train)
'gender' and 'status' in list(X_train_check) #True
# %% Create a class to select numerical or categorical columns 

class ColumnExtractor(BaseEstimator,TransformerMixin):
    def __init__(self, key):
        # key is the column name as str
        self.key = key
    def fit(self, X, y=None):
        # stateless transformer
        return self
    def transform(self, X):
        # assumes X is a DataFrame
        return X[self.key]
# %% Encoding categorical data
cat_attribs = ['checking_balance',
                 'credit_history',
                 'purpose',
                 'savings_balance',
                 'employment_duration',
                 'debtors_guarantors',
                 'property',
                 'other_installment',
                 'housing',
                 'job',
                 'phone',
                 'gender', 
                 'status']
# %% Pipeline categorical
categorical_attribs = Pipeline([
                                ('selector', ColumnExtractor(key=cat_attribs)),
                                ('encoder',OneHotEncoder(drop='first',sparse=False,))
                                ])
# %% Full Pipeline
full_pipeline = FeatureUnion(transformer_list=[("gender_status_attribs", gender_status_attribs),
                                               ("categorical_attribs", categorical_attribs),
                                               ])
X_train_prepared=full_pipeline.transform(X_train) 
# KeyError: "['gender', 'status'] not in index"

Answer 1

我在您的代码中发现了 2 个问题：

而不是做 FeatureUnion 你需要做一个 Pipeline 因为第二个转换器期望来自第一个的输入（如果你确实期望一个特征联合你需要做一个FeatureUnion 和 Pipeline)
的组合
full_pipeline.transform 应该变成 full_pipeline.fit_transform

当您将这些行更改为：

# %% Full Pipeline
full_pipeline = Pipeline([("gender_status_attribs", gender_status_attribs),
                          ("categorical_attribs", categorical_attribs),
                                               ])
X_train_prepared=full_pipeline.fit_transform(X_train)

您的代码运行不会出错。

编辑

如果你坚持使用FeatureUnion你可以考虑：

ppl = Pipeline([("gender_status", gender_status_attribs),("categorical_attribs", categorical_attribs)])
full_pipeline = FeatureUnion([("gender_status_attribs", gender_status_attribs),("pipeline",ppl)])
X_train_prepared=full_pipeline.fit_transform(X_train)

带有自定义转换器 Class 的管道在使用 Featureunion 的完整管道中不起作用

Pipeline with Custom Transformer Class does not work within a full Pipeline using Featureunion

python

pipeline

scikit-learn