为什么我们需要在 PyTorch 中调用 zero_grad()？

Question

为什么训练时需要调用zero_grad()？

|  zero_grad(self)
|      Sets gradients of all model parameters to zero.

Answer 1

在 PyTorch, for every mini-batch during the training phase, we typically want to explicitly set the gradients to zero before starting to do backpropragation (i.e., updating the Weights and biases) because PyTorch accumulates the gradients on subsequent backward passes. This accumulating behaviour is convenient while training RNNs or when we want to compute the gradient of the loss summed over multiple mini-batches. So, the default action has been set to accumulate (i.e. sum) the gradients 中，每次 loss.backward() 调用。

因此，当您开始训练循环时，理想情况下您应该 zero out the gradients 以便正确更新参数。否则，梯度将是您已用于更新模型参数的旧梯度与新计算梯度的组合。因此，它会指向 minimum（或 maximum，在最大化目标的情况下）的预期方向之外的其他方向。

这是一个简单的例子：

import torch
from torch.autograd import Variable
import torch.optim as optim

def linear_model(x, W, b):
    return torch.matmul(x, W) + b

data, targets = ...

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

optimizer = optim.Adam([W, b])

for sample, target in zip(data, targets):
    # clear out the gradients of all Variables 
    # in this optimizer (i.e. W, b)
    optimizer.zero_grad()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    optimizer.step()

或者，如果您正在执行 vanilla 梯度下降，那么：

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

for sample, target in zip(data, targets):
    # clear out the gradients of Variables 
    # (i.e. W, b)
    W.grad.data.zero_()
    b.grad.data.zero_()

    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()

    W -= learning_rate * W.grad.data
    b -= learning_rate * b.grad.data

注:

梯度的累加（即总和）发生在.backward() is called on the loss tensor.
从 v1.7.0 开始，Pytorch 提供了将梯度重置为 None optimizer.zero_grad(set_to_none=True) 的选项，而不是用零张量填充它们。文档声称此设置减少了内存需求并略微提高了性能，但如果处理不当可能容易出错。

Answer 2

zero_grad() 如果您使用梯度方法来减少错误（或损失），则从最后一步开始没有损失地重新开始循环。

如果不使用zero_grad()损失会按要求增加不减少

例如：

如果您使用 zero_grad()，您将得到以下输出：

model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2

如果你不使用 zero_grad() 你将得到以下输出：

model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5

Answer 3

虽然可以从选择的答案中推导出思路，但我觉得我想明确地写出来。

能够决定何时调用 optimizer.zero_grad() 和 optimizer.step() 为优化器在训练循环中如何累积和应用梯度提供了更多自由。当模型或输入数据很大并且一个实际训练批次不适合 gpu 卡时，这一点至关重要。

在这个来自 google-research 的示例中，有两个参数，名为 train_batch_size 和 gradient_accumulation_steps。

train_batch_size 是正向传递的批量大小，在 loss.backward() 之后。这是受gpu内存的限制。
gradient_accumulation_steps 是实际的训练批量大小，其中累积了多次前向传递的损失。这是 NOT 受 gpu 内存限制。

从这个例子中，您可以看到 optimizer.zero_grad() 后跟 optimizer.step() 但 而不是 loss.backward()。 loss.backward() 在每次迭代中调用（第 216 行），但 optimizer.zero_grad() 和 optimizer.step() 仅在累积训练批次的数量等于 gradient_accumulation_steps 时调用（第 227 行 if 块在第 219 行)

https://github.com/google-research/xtreme/blob/master/third_party/run_classify.py

也有人在询问 TensorFlow 中的等效方法。我想 tf.GradientTape 也有同样的目的。

（本人初学AI库，如有不妥请指正）

Answer 4

您不必调用 grad_zero() 或者可以衰减梯度，例如：

optimizer = some_pytorch_optimizer
# decay the grads :
for group in optimizer.param_groups:
    for p in group['params']:
        if p.grad is not None:
            ''' original code from git:
            if set_to_none:
                p.grad = None
            else:
                if p.grad.grad_fn is not None:
                    p.grad.detach_()
                else:
                    p.grad.requires_grad_(False)
                p.grad.zero_()
                
            '''
            p.grad = p.grad / 2

这样学习会更持续

Answer 5

在前馈传播期间，权重被分配给输入，在第一次迭代后，权重被初始化，模型从样本（输入）中学到了什么。当我们开始反向传播时，我们想要更新权重以使我们的成本函数损失最小。所以我们清除我们之前的权重以获得更多更好的权重。我们在训练中一直这样做，但在测试中不执行此操作，因为我们已经获得了最适合我们数据的训练时间权重。希望这会清除更多！

为什么我们需要在 PyTorch 中调用 zero_grad()？

Why do we need to call zero_grad() in PyTorch?

python

neural-network

gradient-descent

deep-learning

pytorch