将多个自定义类与 Pipeline sklearn (Python) 结合使用

Question

我尝试为学生做一个关于 Pipeline 的教程，但我阻止了。我不是专家，但我正在努力改进。所以谢谢你的宽容。事实上，我尝试在管道中执行几个步骤来为分类器准备数据框：

第 1 步：数据帧的描述
第 2 步：填充 NaN 值
第 3 步：将分类值转换为数字

这是我的代码：

class Descr_df(object):

    def transform (self, X):
        print ("Structure of the data: \n {}".format(X.head(5)))
        print ("Features names: \n {}".format(X.columns))
        print ("Target: \n {}".format(X.columns[0]))
        print ("Shape of the data: \n {}".format(X.shape))

    def fit(self, X, y=None):
        return self

class Fillna(object):

    def transform(self, X):
        non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)
        for column in X.columns:
            if column in non_numerics_columns:
                X[column] = X[column].fillna(df[column].value_counts().idxmax())
            else:
                 X[column] = X[column].fillna(X[column].mean())            
        return X

    def fit(self, X,y=None):
        return self

class Categorical_to_numerical(object):

    def transform(self, X):
        non_numerics_columns = X.columns.difference(X._get_numeric_data().columns)
        le = LabelEncoder()
        for column in non_numerics_columns:
            X[column] = X[column].fillna(X[column].value_counts().idxmax())
            le.fit(X[column])
            X[column] = le.transform(X[column]).astype(int)
        return X

    def fit(self, X, y=None):
        return self

如果我执行第 1 步和第 2 步或第 1 步和第 3 步，它会起作用，但如果我同时执行第 1 步、第 2 步和第 3 步。我有这个错误：

pipeline = Pipeline([('df_intropesction', Descr_df()), ('fillna',Fillna()), ('Categorical_to_numerical', Categorical_to_numerical())])
pipeline.fit(X, y)
AttributeError: 'NoneType' object has no attribute 'columns'

Answer 1

出现此错误是因为在管道中，第一个估算器的输出进入第二个，然后第二个估算器的输出进入第三个，依此类推...

来自documentation of Pipeline：

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

因此对于您的管道，执行步骤如下：

Descr_df.fit(X) -> 什么都不做，returns self
newX = Descr_df.transform(X) -> 应该 return 一些值分配给 newX 应该传递给下一个估计器，但你的定义没有 return 任何东西（仅打印）。所以 None 是 return 隐式
Fillna.fit(newX) -> 什么都不做，returns self
Fillna.transform(newX) -> 调用 newX.columns。但是来自步骤 2 的 newX=None。因此错误。

解决方案：将Descr_df的transform方法改为returndataframe原样：

def transform (self, X):
    print ("Structure of the data: \n {}".format(X.head(5)))
    print ("Features names: \n {}".format(X.columns))
    print ("Target: \n {}".format(X.columns[0]))
    print ("Shape of the data: \n {}".format(X.shape))
    return X

建议 : 让你的类继承自 scikit 中的 Base Estimator 和 Transformer 类以确认良好实践。

即将 class Descr_df(object) 更改为 class Descr_df(BaseEstimator, TransformerMixin)，Fillna(object) 更改为 Fillna(BaseEstimator, TransformerMixin) 等等。

有关管道中自定义类的更多详细信息，请参阅此示例：

http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

将多个自定义类与 Pipeline sklearn (Python) 结合使用

Using multiple custom classes with Pipeline sklearn (Python)

python

pipeline

machine-learning

pandas

scikit-learn

将多个自定义 类 与 Pipeline sklearn (Python) 结合使用

Using multiple custom classes with Pipeline sklearn (Python)

python

pipeline

machine-learning

pandas

scikit-learn

将多个自定义类与 Pipeline sklearn (Python) 结合使用