关于 PyTorch 中验证过程的问题:val_loss 低于 train_loss
A question about validation process in PyTorch: val_loss lower than train_loss
我的深度学习是否有可能 运行 在训练过程中的某个时刻,我的验证损失会变得低于我的训练损失?我附上了训练过程的代码:
def train_model(model, train_loader,val_loader,lr):
"Model training"
epochs=100
model.train()
train_losses = []
val_losses = []
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
#Reduce learning rate if no improvement is observed after 10 Epochs.
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=2, verbose=True)
for epoch in range(epochs):
for data in train_loader:
y_pred = model.forward(data)
loss1 = criterion(y_pred[:, 0], data.y[0])
loss2 = criterion(y_pred[:,1], data.y[1])
train_loss = 0.8*loss1+0.2*loss2
optimizer.zero_grad()
train_loss.backward()
optimizer.step()
train_losses.append(train_loss.detach().numpy())
with torch.no_grad():
for data in val_loader:
y_val = model.forward(data)
loss1 = criterion(y_val[:,0], data.y[0])
loss2 = criterion(y_val[:,1], data.y[1])
val_loss = 0.8*loss1+0.2*loss2
#scheduler.step(loss)
val_losses.append(val_loss.detach().numpy())
print(f'Epoch: {epoch}, train_loss: {train_losses[epoch]:.3f} , val_loss: {val_losses[epoch]:.3f}')
return train_losses, val_losses
这是一个多任务模型,我分别计算两个损失,然后考虑加权和。
我不确定 val_loss
的缩进可能会导致打印时出现一些问题。一般来说,我会说我对验证有一些困惑:
1) 首先,我传递了我 train_loader
中的所有批次并调整训练损失。
2) 然后,我开始迭代我的 val_loader
以对单批看不见的数据进行预测,但我在 val_losses
列表中附加的是模型计算的验证损失val_loader
中的最后一批。我不确定这是否正确。
我在训练期间附上打印的火车和 val 损失:
Epoch: 0, train_loss: 7.315 , val_loss: 7.027
Epoch: 1, train_loss: 7.227 , val_loss: 6.943
Epoch: 2, train_loss: 7.129 , val_loss: 6.847
Epoch: 3, train_loss: 7.021 , val_loss: 6.741
Epoch: 4, train_loss: 6.901 , val_loss: 6.624
Epoch: 5, train_loss: 6.769 , val_loss: 6.493
Epoch: 6, train_loss: 6.620 , val_loss: 6.347
Epoch: 7, train_loss: 6.452 , val_loss: 6.182
Epoch: 8, train_loss: 6.263 , val_loss: 5.996
Epoch: 9, train_loss: 6.051 , val_loss: 5.788
Epoch: 10, train_loss: 5.814 , val_loss: 5.555
Epoch: 11, train_loss: 5.552 , val_loss: 5.298
Epoch: 12, train_loss: 5.270 , val_loss: 5.022
Epoch: 13, train_loss: 4.972 , val_loss: 4.731
Epoch: 14, train_loss: 4.666 , val_loss: 4.431
Epoch: 15, train_loss: 4.357 , val_loss: 4.129
Epoch: 16, train_loss: 4.049 , val_loss: 3.828
Epoch: 17, train_loss: 3.752 , val_loss: 3.539
Epoch: 18, train_loss: 3.474 , val_loss: 3.269
Epoch: 19, train_loss: 3.220 , val_loss: 3.023
Epoch: 20, train_loss: 2.992 , val_loss: 2.803
Epoch: 21, train_loss: 2.793 , val_loss: 2.613
Epoch: 22, train_loss: 2.626 , val_loss: 2.453
Epoch: 23, train_loss: 2.488 , val_loss: 2.323
Epoch: 24, train_loss: 2.378 , val_loss: 2.220
Epoch: 25, train_loss: 2.290 , val_loss: 2.140
Epoch: 26, train_loss: 2.221 , val_loss: 2.078
Epoch: 27, train_loss: 2.166 , val_loss: 2.029
Epoch: 28, train_loss: 2.121 , val_loss: 1.991
Epoch: 29, train_loss: 2.084 , val_loss: 1.959
Epoch: 30, train_loss: 2.051 , val_loss: 1.932
Epoch: 31, train_loss: 2.022 , val_loss: 1.909
Epoch: 32, train_loss: 1.995 , val_loss: 1.887
Epoch: 33, train_loss: 1.970 , val_loss: 1.867
Epoch: 34, train_loss: 1.947 , val_loss: 1.849
Epoch: 35, train_loss: 1.924 , val_loss: 1.831
Epoch: 36, train_loss: 1.902 , val_loss: 1.815
Epoch: 37, train_loss: 1.880 , val_loss: 1.799
Epoch: 38, train_loss: 1.859 , val_loss: 1.783
Epoch: 39, train_loss: 1.839 , val_loss: 1.769
Epoch: 40, train_loss: 1.820 , val_loss: 1.755
Epoch: 41, train_loss: 1.800 , val_loss: 1.742
Epoch: 42, train_loss: 1.781 , val_loss: 1.730
Epoch: 43, train_loss: 1.763 , val_loss: 1.717
Epoch: 44, train_loss: 1.744 , val_loss: 1.705
Epoch: 45, train_loss: 1.726 , val_loss: 1.694
Epoch: 46, train_loss: 1.708 , val_loss: 1.683
...
所以我怀疑我搞砸了缩进..
验证损失可以低于训练损失。
正如您在第 2 点中提到的,您只是 storing/appending 最后一批的训练和验证损失。这可能不是您想要的,您可能希望存储每次迭代的训练损失并在最后查看其平均值。这将使您更好地了解培训进度,因为这将是整个数据和单个批次的损失
我的深度学习是否有可能 运行 在训练过程中的某个时刻,我的验证损失会变得低于我的训练损失?我附上了训练过程的代码:
def train_model(model, train_loader,val_loader,lr):
"Model training"
epochs=100
model.train()
train_losses = []
val_losses = []
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
#Reduce learning rate if no improvement is observed after 10 Epochs.
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=2, verbose=True)
for epoch in range(epochs):
for data in train_loader:
y_pred = model.forward(data)
loss1 = criterion(y_pred[:, 0], data.y[0])
loss2 = criterion(y_pred[:,1], data.y[1])
train_loss = 0.8*loss1+0.2*loss2
optimizer.zero_grad()
train_loss.backward()
optimizer.step()
train_losses.append(train_loss.detach().numpy())
with torch.no_grad():
for data in val_loader:
y_val = model.forward(data)
loss1 = criterion(y_val[:,0], data.y[0])
loss2 = criterion(y_val[:,1], data.y[1])
val_loss = 0.8*loss1+0.2*loss2
#scheduler.step(loss)
val_losses.append(val_loss.detach().numpy())
print(f'Epoch: {epoch}, train_loss: {train_losses[epoch]:.3f} , val_loss: {val_losses[epoch]:.3f}')
return train_losses, val_losses
这是一个多任务模型,我分别计算两个损失,然后考虑加权和。
我不确定 val_loss
的缩进可能会导致打印时出现一些问题。一般来说,我会说我对验证有一些困惑:
1) 首先,我传递了我 train_loader
中的所有批次并调整训练损失。
2) 然后,我开始迭代我的 val_loader
以对单批看不见的数据进行预测,但我在 val_losses
列表中附加的是模型计算的验证损失val_loader
中的最后一批。我不确定这是否正确。
我在训练期间附上打印的火车和 val 损失:
Epoch: 0, train_loss: 7.315 , val_loss: 7.027
Epoch: 1, train_loss: 7.227 , val_loss: 6.943
Epoch: 2, train_loss: 7.129 , val_loss: 6.847
Epoch: 3, train_loss: 7.021 , val_loss: 6.741
Epoch: 4, train_loss: 6.901 , val_loss: 6.624
Epoch: 5, train_loss: 6.769 , val_loss: 6.493
Epoch: 6, train_loss: 6.620 , val_loss: 6.347
Epoch: 7, train_loss: 6.452 , val_loss: 6.182
Epoch: 8, train_loss: 6.263 , val_loss: 5.996
Epoch: 9, train_loss: 6.051 , val_loss: 5.788
Epoch: 10, train_loss: 5.814 , val_loss: 5.555
Epoch: 11, train_loss: 5.552 , val_loss: 5.298
Epoch: 12, train_loss: 5.270 , val_loss: 5.022
Epoch: 13, train_loss: 4.972 , val_loss: 4.731
Epoch: 14, train_loss: 4.666 , val_loss: 4.431
Epoch: 15, train_loss: 4.357 , val_loss: 4.129
Epoch: 16, train_loss: 4.049 , val_loss: 3.828
Epoch: 17, train_loss: 3.752 , val_loss: 3.539
Epoch: 18, train_loss: 3.474 , val_loss: 3.269
Epoch: 19, train_loss: 3.220 , val_loss: 3.023
Epoch: 20, train_loss: 2.992 , val_loss: 2.803
Epoch: 21, train_loss: 2.793 , val_loss: 2.613
Epoch: 22, train_loss: 2.626 , val_loss: 2.453
Epoch: 23, train_loss: 2.488 , val_loss: 2.323
Epoch: 24, train_loss: 2.378 , val_loss: 2.220
Epoch: 25, train_loss: 2.290 , val_loss: 2.140
Epoch: 26, train_loss: 2.221 , val_loss: 2.078
Epoch: 27, train_loss: 2.166 , val_loss: 2.029
Epoch: 28, train_loss: 2.121 , val_loss: 1.991
Epoch: 29, train_loss: 2.084 , val_loss: 1.959
Epoch: 30, train_loss: 2.051 , val_loss: 1.932
Epoch: 31, train_loss: 2.022 , val_loss: 1.909
Epoch: 32, train_loss: 1.995 , val_loss: 1.887
Epoch: 33, train_loss: 1.970 , val_loss: 1.867
Epoch: 34, train_loss: 1.947 , val_loss: 1.849
Epoch: 35, train_loss: 1.924 , val_loss: 1.831
Epoch: 36, train_loss: 1.902 , val_loss: 1.815
Epoch: 37, train_loss: 1.880 , val_loss: 1.799
Epoch: 38, train_loss: 1.859 , val_loss: 1.783
Epoch: 39, train_loss: 1.839 , val_loss: 1.769
Epoch: 40, train_loss: 1.820 , val_loss: 1.755
Epoch: 41, train_loss: 1.800 , val_loss: 1.742
Epoch: 42, train_loss: 1.781 , val_loss: 1.730
Epoch: 43, train_loss: 1.763 , val_loss: 1.717
Epoch: 44, train_loss: 1.744 , val_loss: 1.705
Epoch: 45, train_loss: 1.726 , val_loss: 1.694
Epoch: 46, train_loss: 1.708 , val_loss: 1.683
...
所以我怀疑我搞砸了缩进..
验证损失可以低于训练损失。
正如您在第 2 点中提到的,您只是 storing/appending 最后一批的训练和验证损失。这可能不是您想要的,您可能希望存储每次迭代的训练损失并在最后查看其平均值。这将使您更好地了解培训进度,因为这将是整个数据和单个批次的损失