使用 fit_transfrom 或预测对象而不是拟合对象的 sklearn 管道

Question

This example on sklearn website and 使用并仅谈论在 Pipleines 中使用 .fit() 或 .fit_transform() 方法。

但是，如何在管道中使用 .predict 或 .transfrom 方法。假设我已经预处理了我的训练数据，搜索了最佳超参数并训练了一个 LightGBM 模型。我现在想根据 definition:

预测新数据，而不是手动完成上述所有事情，我想一个接一个地完成它们

Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.

但是，我只想在我的验证（或测试）数据上实现 .transform 方法，以及一些采用 pandas 系列（或 DataFrame 或numpy 数组）和 return 处理一个，然后最终实现我的 LightGBM 的 .predict 方法，它将使用我已经拥有的超参数。

我目前什么都没有，因为我不知道如何正确地包含类的方法（比如 StandardScaler_instance.transform()) 和更多这样的方法。!

我该怎么做或者我错过了什么？

Answer 1

您必须构建您的管道，其中将包括 LightGBM 模型并在您的 (pre-processed) 训练数据上训练管道。

使用代码，它可能如下所示：

import lightgbm
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create some train and test data
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Define pipeline with scaler and lightgbm model
pipe = Pipeline([('scaler', StandardScaler()), ('lightgbm', lightgbm.LGBMClassifier())])

# Train pipeline
pipe.fit(X_train, y_train)

# Make predictions with pipeline (with lightgbm)
print("Predictions:", pipe.predict(X_test))

# Evaluate pipeline performance
print("Performance score:", pipe.score(X_test, y_test))

输出：

Predictions: [1 0 1 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 1 1 1 0 1 0 0]
Performance score: 0.84

所以回答你的问题：

But, how do I use .predict or .transfrom methods in Pipelines.

您不必使用 .transform，因为管道会使用提供的转换器自动处理输入数据的转换。这就是为什么在 documentation 中提到：

Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods.

您可以将代码示例中所示的 .predict 与您的测试数据一起使用。

代替我在此示例中使用的 StandardScaler，您可以为管道提供自定义转换器，但它必须实现管道可以调用的 .transform() 和 .fit() 方法以及transformer需要匹配lightgbm模型所需的输入。

更新

然后您可以按照文档中的说明为管道的不同步骤提供参数 here:

**fit_paramsdict of string -> object Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.

使用 fit_transfrom 或预测对象而不是拟合对象的 sklearn 管道

sklearn pipelines with fit_transfrom or predict objects instead of fit objects

python

pipeline

machine-learning

scikit-learn