sklearn 管道中的 PyTorch 训练循环

PyTorch training loop within a sklearn pipeline

我现在正在玩的是在一个管道中使用 PyTorch,所有的预处理都将在这里处理。

我能够让它发挥作用。但是,我得到的结果有点偏离。随着训练循环的进行,损失函数似乎没有减少并卡住(大概在局部最优?)。

我遵循标准的 PyTorch 训练循环并将其包装在 fit 方法中,因为这是 sklearn 想要的:

import torch
from sklearn.base import BaseEstimator, TransformerMixin

import torch.nn.functional as F

from IPython.core.debugger import set_trace

# +
import pandas as pd
import seaborn as sns
import numpy as np

from tqdm import tqdm
import random
# -

df = sns.load_dataset("tips")
df.head()


# +
class LinearRegressionModel(torch.nn.Module, BaseEstimator, TransformerMixin):
 
    def __init__(self, loss_func = torch.nn.MSELoss()):
        super(LinearRegressionModel, self).__init__()
        self.linear = torch.nn.Linear(3, 1)  # One in and one out
        self.loss_func = loss_func
        self.optimizer = torch.optim.SGD(self.parameters(), lr = 0.01)
 
    def forward(self, x):
        y_pred = F.relu(self.linear(x))
        return y_pred    
    
    def fit(self, X, y):
        
#         set_trace()        

        X = torch.from_numpy(X.astype(np.float32))
        y = torch.from_numpy(y.values.astype(np.float32))
                
        for epoch in tqdm(range(0, 12)):
             
            pred_y = self.forward(X)

            # Compute and print loss
            loss = self.loss_func(pred_y, X)

            # Zero gradients, perform a backward pass,
            # and update the weights.
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
            print('epoch {}, loss {}'.format(epoch, loss.item()))            


# +
from sklearn.pipeline import Pipeline

from sklego.preprocessing import PatsyTransformer
# -

my_model = LinearRegressionModel()

pipe = Pipeline([
    ("patsy", PatsyTransformer("tip + size")),
    ("model", my_model)
])

pipe.fit(df, df['total_bill'])

不仅仅是因为模型太简单了。如果我使用通过随机梯度下降 (SGDRegressor) 估计的 sklearn 线性回归,结果看起来不错。因此,我得出结论,问题出在我的 PyTorch class

# +
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

pipe2 = Pipeline([
    ("patsy", PatsyTransformer("tip + C(size) + C(time)")),
    ("model", LinearRegression())
])

pipe2.fit(df, df['total_bill'])
# -

mean_squared_error(df['total_bill'], pipe2.predict(df))



此实现中的问题出在 fit 方法中。

我们正在比较预测和设计矩阵

# Compute and print loss
loss = self.loss_func(pred_y, X)

应该是预测值和真实值y:

loss = self.loss_func(pred_y, y)