在 CalibratedClassifierCV 中使用管道分类器

Question

我正在尝试训练 XGBoost 分类器。目标变量 y 是二进制的。

DATA（无法找到样本数据集以使其完全可重现。抱歉）。

X_train, X_validate, X_test（包含数值和分类数据）

y_train, y_validate, y_test（值为二进制 1/0）。

预处理器.

categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='MISSING')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))])
    
    
    numerical_transformer = Pipeline(steps=[  
        ('imputer', SimpleImputer(strategy='constant', fill_value=-999))])
    
    preprocessor = ColumnTransformer(
        remainder='passthrough',
        transformers=[
            ('cat', categorical_transformer, selector(dtype_include="object")),
            ('num', numerical_transformer, selector(dtype_exclude="object"))
        ])

型号.

best_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', 
                        xgb.XGBClassifier(
                        seed=42,
                        objective='binary:logistic',
                        missing=-999,
                        ## optimal params
                        learning_rate = 0.1))])

best_clf.fit(X_train, y_train, 
            classifier__early_stopping_rounds=10,
            classifier__eval_metric='aucpr',
            classifier__eval_set=[(X_validate_preprocessed, y_validate)],
            classifier__verbose=True)

目前一切正常。我现在有模型。但是我想校准这个模型。

校准.

我试过了：

best_clf_calib = Pipeline(steps=[('preprocessor', preprocessor),
                                ('calibrator', CalibratedClassifierCV(
                                                    base_estimator=best_clf.named_steps.classifier,
                                                    cv='prefit', 
                                                    method='isotonic'))])

best_clf_calib.fit(X_validate, y_validate)

但它给我以下错误：

TypeError: predict_proba() got an unexpected keyword argument 'X'

问题：CalibratedClassifierCV中的base_estimator参数具体应该怎么设置？我试过设置

base_estimator = best_clf

但在那种情况下，管道似乎得到了两次运行。这是流水线步骤图。

Answer 1

感谢您的回复，我很高兴降级适合您的 sklearn 版本。在此处发布 link 以供将来参考。

Answer 2

您不一定需要降级 sklearn。

我认为问题出在 XGBoost 上。在这里解释：https://github.com/dmlc/xgboost/pull/6555

XGBoost 定义：

predict_proba(self, data, ...

而不是：

predict_proba(self, X, ...

并且由于 sklearn 0.24 调用 clf.predict_proba(X=X)，抛出异常。

这里有一个解决问题而不改变包版本的想法：创建一个继承 XGBoostClassifier 的 class 以使用正确的参数名称覆盖 predict_proba 并调用 super().

在 CalibratedClassifierCV 中使用管道分类器

Using pipeline classifier inside of CalibratedClassifierCV

pipeline

machine-learning

python-3.x

scikit-learn

xgboost