Pytorch 中基于磁带的 autograd 是什么?

What is tape-based autograd in Pytorch?

我理解autograd是用来表示自动微分的。但是Pytorch中的tape-based autograd到底是什么,为什么会有那么多肯定或否定的讨论。

例如:

this

In pytorch, there is no traditional sense of tape

this

We don’t really build gradient tapes per se. But graphs.

但不是 this

Autograd is now a core torch package for automatic differentiation. It uses a tape based system for automatic differentiation.

如需进一步参考,请将其与 Tensorflow 中的 GradientTape 进行比较。

我怀疑这是因为 'tape' 这个词在自动微分的上下文中有两种不同的用法。

当人们说 不是基于磁带的时,他们的意思是它使用运算符重载而不是 [基于磁带的] 源转换来自动区分。

[Operator overloading] relies on a language’s ability to redefine the meaning of functions and operators. All primitives are overloaded so that they additionally perform a tracing operation: The primitive is logged onto a ‘tape’, along with its inputs to ensure that those intermediate variables are kept alive. At the end of the function’s execution, this tape contains a linear trace of all the numerical operations in the program. Derivatives can be calculated by walking this tape in reverse. [...]
OO is the technique used by PyTorch, Autograd, and Chainer [37].

...

Tape-based Frameworks such as ADIFOR [8] and Tapenade [20] for Fortran and C use a global stack also called a ‘tape’2 to ensure that intermediate variables are kept alive. The original (primal) function is augmented so that it writes intermediate variables to the tape during the forward pass, and the adjoint program will read intermediate variables from the tape during the backward pass. More recently, tape-based ST was implemented for Python in the ML framework Tangent [38].

...

2 The tape used in ST stores only the intermediate variables, whereas the tape in OO is a program trace that stores the executed primitives as well.

有不同类型的自动微分,例如forward-modereverse-modehybrids; (more explanation). The tape-based autograd in Pytorch simply refers to the uses of reverse-mode automatic differentiation, source. The reverse-mode auto diff is simply a technique used to compute gradients efficiently and it happens to be used by backpropagation, .


现在,在 PyTorch 中,Autograd 是自动微分的核心 torch 包。它使用 tape-based 系统进行 自动微分 。在前向阶段,autograd 磁带会记住它执行的所有操作,在后向阶段,它会重放这些操作.

TensorFlow, to differentiate automatically, It also needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients. Now, TensorFlow provides the tf.GradientTape API for automatic differentiation; that is, computing the gradient of computation with respect to some inputs, usually tf.Variables. TensorFlow records relevant operations executed inside the context of a tf.GradientTape onto a tape. TensorFlow then uses that tape to compute the gradients of a recorded computation using reverse mode differentiation中相同。

所以,从高层的角度来看,两者都在做同样的操作。但是,在自定义训练循环期间,forward 传递和 loss 的计算在 TensorFlow 中更加明确,因为它使用 tf.GradientTape API 范围,而在 PyTorch 这些操作是隐含的,但它需要在更新训练参数(权重和偏差)时暂时将 required_grad 标志设置为 False。为此,它明确地使用 torch.no_grad API。换句话说,TensorFlow 的 tf.GradientTape() 类似于 PyTorch 的 loss.backward()。下面是上述语句代码中的简单形式。

# TensorFlow 
[w, b] = tf_model.trainable_variables
for epoch in range(epochs):
  with tf.GradientTape() as tape:
    # forward passing and loss calculations 
    # within explicit tape scope 
    predictions = tf_model(x)
    loss = squared_error(predictions, y)

  # compute gradients (grad)
  w_grad, b_grad = tape.gradient(loss, tf_model.trainable_variables)

  # update training variables 
  w.assign(w - w_grad * learning_rate)
  b.assign(b - b_grad * learning_rate)


# PyTorch 
[w, b] = torch_model.parameters()
for epoch in range(epochs):
  # forward pass and loss calculation 
  # implicit tape-based AD 
  y_pred = torch_model(inputs)
  loss = squared_error(y_pred, labels)

  # compute gradients (grad)
  loss.backward()
  
  # update training variables / parameters  
  with torch.no_grad():
    w -= w.grad * learning_rate
    b -= b.grad * learning_rate
    w.grad.zero_()
    b.grad.zero_()

仅供参考,在上面,可训练变量(wb)在两个框架中都是手动更新的,但我们通常使用优化器(例如adam)来完成工作。

# TensorFlow 
# ....
# update training variables 
optimizer.apply_gradients(zip([w_grad, b_grad], model.trainable_weights))

# PyTorch
# ....
# update training variables / parameters
optimizer.step()
optimizer.zero_grad()