Pytorch 中基于磁带的 autograd 是什么?
What is tape-based autograd in Pytorch?
我理解autograd
是用来表示自动微分的。但是Pytorch
中的tape-based autograd
到底是什么,为什么会有那么多肯定或否定的讨论。
例如:
In pytorch, there is no traditional sense of tape
和this
We don’t really build gradient tapes per se. But graphs.
但不是 this
Autograd is now a core torch package for automatic differentiation. It
uses a tape based system for automatic differentiation.
如需进一步参考,请将其与 Tensorflow
中的 GradientTape
进行比较。
我怀疑这是因为 'tape' 这个词在自动微分的上下文中有两种不同的用法。
当人们说 pytorch 不是基于磁带的时,他们的意思是它使用运算符重载而不是 [基于磁带的] 源转换来自动区分。
[Operator overloading] relies on a language’s ability to redefine the meaning of functions and operators. All primitives are overloaded so that they additionally perform a tracing operation: The primitive is logged onto a ‘tape’, along with its inputs to ensure that those intermediate variables are kept alive. At the end of the function’s execution, this tape contains a linear trace of all the numerical operations in the program. Derivatives can be calculated by walking this tape in reverse. [...]
OO is the technique used by PyTorch, Autograd, and Chainer [37].
...
Tape-based Frameworks such as ADIFOR [8] and Tapenade [20] for Fortran and C use a global stack also called a ‘tape’2 to ensure that intermediate variables are kept alive. The original (primal) function is augmented so that it writes intermediate variables to the tape during the forward pass, and the adjoint program will read intermediate variables from the tape during the backward pass. More recently, tape-based ST was implemented for Python in the ML framework Tangent [38].
...
2 The tape used in ST stores only the intermediate variables, whereas the tape in OO is a program trace that stores the executed primitives as well.
有不同类型的自动微分,例如forward-mode
、reverse-mode
、hybrids
; (more explanation). The tape-based
autograd in Pytorch
simply refers to the uses of reverse-mode automatic differentiation, source. The reverse-mode auto diff is simply a technique used to compute gradients efficiently and it happens to be used by backpropagation, .
现在,在 PyTorch 中,Autograd 是自动微分的核心 torch 包。它使用 tape-based
系统进行 自动微分 。在前向阶段,autograd
磁带会记住它执行的所有操作,在后向阶段,它会重放这些操作.
在TensorFlow, to differentiate automatically, It also needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients. Now, TensorFlow provides the tf.GradientTape
API for automatic differentiation; that is, computing the gradient of computation with respect to some inputs, usually tf.Variables
. TensorFlow records relevant operations executed inside the context of a tf.GradientTape
onto a tape. TensorFlow then uses that tape to compute the gradients of a recorded computation using reverse mode differentiation中相同。
所以,从高层的角度来看,两者都在做同样的操作。但是,在自定义训练循环期间,forward
传递和 loss
的计算在 TensorFlow
中更加明确,因为它使用 tf.GradientTape
API 范围,而在 PyTorch
这些操作是隐含的,但它需要在更新训练参数(权重和偏差)时暂时将 required_grad
标志设置为 False
。为此,它明确地使用 torch.no_grad
API。换句话说,TensorFlow 的 tf.GradientTape()
类似于 PyTorch 的 loss.backward()
。下面是上述语句代码中的简单形式。
# TensorFlow
[w, b] = tf_model.trainable_variables
for epoch in range(epochs):
with tf.GradientTape() as tape:
# forward passing and loss calculations
# within explicit tape scope
predictions = tf_model(x)
loss = squared_error(predictions, y)
# compute gradients (grad)
w_grad, b_grad = tape.gradient(loss, tf_model.trainable_variables)
# update training variables
w.assign(w - w_grad * learning_rate)
b.assign(b - b_grad * learning_rate)
# PyTorch
[w, b] = torch_model.parameters()
for epoch in range(epochs):
# forward pass and loss calculation
# implicit tape-based AD
y_pred = torch_model(inputs)
loss = squared_error(y_pred, labels)
# compute gradients (grad)
loss.backward()
# update training variables / parameters
with torch.no_grad():
w -= w.grad * learning_rate
b -= b.grad * learning_rate
w.grad.zero_()
b.grad.zero_()
仅供参考,在上面,可训练变量(w
、b
)在两个框架中都是手动更新的,但我们通常使用优化器(例如adam
)来完成工作。
# TensorFlow
# ....
# update training variables
optimizer.apply_gradients(zip([w_grad, b_grad], model.trainable_weights))
# PyTorch
# ....
# update training variables / parameters
optimizer.step()
optimizer.zero_grad()
我理解autograd
是用来表示自动微分的。但是Pytorch
中的tape-based autograd
到底是什么,为什么会有那么多肯定或否定的讨论。
例如:
In pytorch, there is no traditional sense of tape
和this
We don’t really build gradient tapes per se. But graphs.
但不是 this
Autograd is now a core torch package for automatic differentiation. It uses a tape based system for automatic differentiation.
如需进一步参考,请将其与 Tensorflow
中的 GradientTape
进行比较。
我怀疑这是因为 'tape' 这个词在自动微分的上下文中有两种不同的用法。
当人们说 pytorch 不是基于磁带的时,他们的意思是它使用运算符重载而不是 [基于磁带的] 源转换来自动区分。
[Operator overloading] relies on a language’s ability to redefine the meaning of functions and operators. All primitives are overloaded so that they additionally perform a tracing operation: The primitive is logged onto a ‘tape’, along with its inputs to ensure that those intermediate variables are kept alive. At the end of the function’s execution, this tape contains a linear trace of all the numerical operations in the program. Derivatives can be calculated by walking this tape in reverse. [...]
OO is the technique used by PyTorch, Autograd, and Chainer [37]....
Tape-based Frameworks such as ADIFOR [8] and Tapenade [20] for Fortran and C use a global stack also called a ‘tape’2 to ensure that intermediate variables are kept alive. The original (primal) function is augmented so that it writes intermediate variables to the tape during the forward pass, and the adjoint program will read intermediate variables from the tape during the backward pass. More recently, tape-based ST was implemented for Python in the ML framework Tangent [38].
...
2 The tape used in ST stores only the intermediate variables, whereas the tape in OO is a program trace that stores the executed primitives as well.
有不同类型的自动微分,例如forward-mode
、reverse-mode
、hybrids
; (more explanation). The tape-based
autograd in Pytorch
simply refers to the uses of reverse-mode automatic differentiation, source. The reverse-mode auto diff is simply a technique used to compute gradients efficiently and it happens to be used by backpropagation,
现在,在 PyTorch 中,Autograd 是自动微分的核心 torch 包。它使用 tape-based
系统进行 自动微分 。在前向阶段,autograd
磁带会记住它执行的所有操作,在后向阶段,它会重放这些操作.
在TensorFlow, to differentiate automatically, It also needs to remember what operations happen in what order during the forward pass. Then, during the backward pass, TensorFlow traverses this list of operations in reverse order to compute gradients. Now, TensorFlow provides the tf.GradientTape
API for automatic differentiation; that is, computing the gradient of computation with respect to some inputs, usually tf.Variables
. TensorFlow records relevant operations executed inside the context of a tf.GradientTape
onto a tape. TensorFlow then uses that tape to compute the gradients of a recorded computation using reverse mode differentiation中相同。
所以,从高层的角度来看,两者都在做同样的操作。但是,在自定义训练循环期间,forward
传递和 loss
的计算在 TensorFlow
中更加明确,因为它使用 tf.GradientTape
API 范围,而在 PyTorch
这些操作是隐含的,但它需要在更新训练参数(权重和偏差)时暂时将 required_grad
标志设置为 False
。为此,它明确地使用 torch.no_grad
API。换句话说,TensorFlow 的 tf.GradientTape()
类似于 PyTorch 的 loss.backward()
。下面是上述语句代码中的简单形式。
# TensorFlow
[w, b] = tf_model.trainable_variables
for epoch in range(epochs):
with tf.GradientTape() as tape:
# forward passing and loss calculations
# within explicit tape scope
predictions = tf_model(x)
loss = squared_error(predictions, y)
# compute gradients (grad)
w_grad, b_grad = tape.gradient(loss, tf_model.trainable_variables)
# update training variables
w.assign(w - w_grad * learning_rate)
b.assign(b - b_grad * learning_rate)
# PyTorch
[w, b] = torch_model.parameters()
for epoch in range(epochs):
# forward pass and loss calculation
# implicit tape-based AD
y_pred = torch_model(inputs)
loss = squared_error(y_pred, labels)
# compute gradients (grad)
loss.backward()
# update training variables / parameters
with torch.no_grad():
w -= w.grad * learning_rate
b -= b.grad * learning_rate
w.grad.zero_()
b.grad.zero_()
仅供参考,在上面,可训练变量(w
、b
)在两个框架中都是手动更新的,但我们通常使用优化器(例如adam
)来完成工作。
# TensorFlow
# ....
# update training variables
optimizer.apply_gradients(zip([w_grad, b_grad], model.trainable_weights))
# PyTorch
# ....
# update training variables / parameters
optimizer.step()
optimizer.zero_grad()