关于 PyTorch 中验证过程的问题:val_loss 低于 train_loss

A question about validation process in PyTorch: val_loss lower than train_loss

我的深度学习是否有可能 运行 在训练过程中的某个时刻,我的验证损失会变得低于我的训练损失?我附上了训练过程的代码:

def train_model(model, train_loader,val_loader,lr):
    
    "Model training"

    epochs=100

    model.train()

    train_losses = []
    
    val_losses = []

    criterion = nn.MSELoss()

    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
    
    #Reduce learning rate if no improvement is observed after 10 Epochs.
    
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=2, verbose=True)

    for epoch in range(epochs):

        for data in train_loader:

            y_pred = model.forward(data)

            loss1 = criterion(y_pred[:, 0], data.y[0])
            
            loss2 = criterion(y_pred[:,1], data.y[1])
            
            train_loss = 0.8*loss1+0.2*loss2

            optimizer.zero_grad()

            train_loss.backward()

            optimizer.step()

        train_losses.append(train_loss.detach().numpy())
                        
        with torch.no_grad():
    
            for data in val_loader:
    
                y_val = model.forward(data)
    
                loss1 = criterion(y_val[:,0], data.y[0])
                
                loss2 = criterion(y_val[:,1], data.y[1])
                
                val_loss = 0.8*loss1+0.2*loss2
            
            #scheduler.step(loss)
            
        val_losses.append(val_loss.detach().numpy())
        
        print(f'Epoch: {epoch}, train_loss: {train_losses[epoch]:.3f} , val_loss: {val_losses[epoch]:.3f}')
            
    return train_losses, val_losses

这是一个多任务模型,我分别计算两个损失,然后考虑加权和。

我不确定 val_loss 的缩进可能会导致打印时出现一些问题。一般来说,我会说我对验证有一些困惑:

1) 首先,我传递了我 train_loader 中的所有批次并调整训练损失。

2) 然后,我开始迭代我的 val_loader 以对单批看不见的数据进行预测,但我在 val_losses 列表中附加的是模型计算的验证损失val_loader 中的最后一批。我不确定这是否正确。 我在训练期间附上打印的火车和 val 损失:

Epoch: 0, train_loss: 7.315 , val_loss: 7.027
Epoch: 1, train_loss: 7.227 , val_loss: 6.943
Epoch: 2, train_loss: 7.129 , val_loss: 6.847
Epoch: 3, train_loss: 7.021 , val_loss: 6.741
Epoch: 4, train_loss: 6.901 , val_loss: 6.624
Epoch: 5, train_loss: 6.769 , val_loss: 6.493
Epoch: 6, train_loss: 6.620 , val_loss: 6.347
Epoch: 7, train_loss: 6.452 , val_loss: 6.182
Epoch: 8, train_loss: 6.263 , val_loss: 5.996
Epoch: 9, train_loss: 6.051 , val_loss: 5.788
Epoch: 10, train_loss: 5.814 , val_loss: 5.555
Epoch: 11, train_loss: 5.552 , val_loss: 5.298
Epoch: 12, train_loss: 5.270 , val_loss: 5.022
Epoch: 13, train_loss: 4.972 , val_loss: 4.731
Epoch: 14, train_loss: 4.666 , val_loss: 4.431
Epoch: 15, train_loss: 4.357 , val_loss: 4.129
Epoch: 16, train_loss: 4.049 , val_loss: 3.828
Epoch: 17, train_loss: 3.752 , val_loss: 3.539
Epoch: 18, train_loss: 3.474 , val_loss: 3.269
Epoch: 19, train_loss: 3.220 , val_loss: 3.023
Epoch: 20, train_loss: 2.992 , val_loss: 2.803
Epoch: 21, train_loss: 2.793 , val_loss: 2.613
Epoch: 22, train_loss: 2.626 , val_loss: 2.453
Epoch: 23, train_loss: 2.488 , val_loss: 2.323
Epoch: 24, train_loss: 2.378 , val_loss: 2.220
Epoch: 25, train_loss: 2.290 , val_loss: 2.140
Epoch: 26, train_loss: 2.221 , val_loss: 2.078
Epoch: 27, train_loss: 2.166 , val_loss: 2.029
Epoch: 28, train_loss: 2.121 , val_loss: 1.991
Epoch: 29, train_loss: 2.084 , val_loss: 1.959
Epoch: 30, train_loss: 2.051 , val_loss: 1.932
Epoch: 31, train_loss: 2.022 , val_loss: 1.909
Epoch: 32, train_loss: 1.995 , val_loss: 1.887
Epoch: 33, train_loss: 1.970 , val_loss: 1.867
Epoch: 34, train_loss: 1.947 , val_loss: 1.849
Epoch: 35, train_loss: 1.924 , val_loss: 1.831
Epoch: 36, train_loss: 1.902 , val_loss: 1.815
Epoch: 37, train_loss: 1.880 , val_loss: 1.799
Epoch: 38, train_loss: 1.859 , val_loss: 1.783
Epoch: 39, train_loss: 1.839 , val_loss: 1.769
Epoch: 40, train_loss: 1.820 , val_loss: 1.755
Epoch: 41, train_loss: 1.800 , val_loss: 1.742
Epoch: 42, train_loss: 1.781 , val_loss: 1.730
Epoch: 43, train_loss: 1.763 , val_loss: 1.717
Epoch: 44, train_loss: 1.744 , val_loss: 1.705
Epoch: 45, train_loss: 1.726 , val_loss: 1.694
Epoch: 46, train_loss: 1.708 , val_loss: 1.683

...

所以我怀疑我搞砸了缩进..

验证损失可以低于训练损失。

正如您在第 2 点中提到的,您只是 storing/appending 最后一批的训练和验证损失。这可能不是您想要的,您可能希望存储每次迭代的训练损失并在最后查看其平均值。这将使您更好地了解培训进度,因为这将是整个数据和单个批次的损失