对 GradientTape.gradient 的概念理解

Conceptual understanding of GradientTape.gradient

背景

在 Tensorflow 2 中,存在一个名为 GradientTape 的 class,用于记录对张量的操作,然后可以将其结果微分并提供给某种最小化算法。例如,from the documentation我们有这个例子:

x = tf.constant(3.0)
with tf.GradientTape() as g:
  g.watch(x)
  y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0

gradient方法的docstring意味着第一个参数不仅可以是张量,还可以是张量列表:

 def gradient(self,
               target,
               sources,
               output_gradients=None,
               unconnected_gradients=UnconnectedGradients.NONE):
    """Computes the gradient using operations recorded in context of this tape.

    Args:
      target: a list or nested structure of Tensors or Variables to be
        differentiated.
      sources: a list or nested structure of Tensors or Variables. `target`
        will be differentiated against elements in `sources`.
      output_gradients: a list of gradients, one for each element of
        target. Defaults to None.
      unconnected_gradients: a value which can either hold 'none' or 'zero' and
        alters the value which will be returned if the target and sources are
        unconnected. The possible values and effects are detailed in
        'UnconnectedGradients' and it defaults to 'none'.

    Returns:
      a list or nested structure of Tensors (or IndexedSlices, or None),
      one for each element in `sources`. Returned structure is the same as
      the structure of `sources`.

    Raises:
      RuntimeError: if called inside the context of the tape, or if called more
       than once on a non-persistent tape.
      ValueError: if the target is a variable or if unconnected gradients is
       called with an unknown value.
    """

在上面的例子中,很容易看出ytarget是待微分的函数,x是因变量"gradient" 是关于.

根据我有限的经验,gradient 方法似乎是 returns 张量列表,每个元素一个 sources,并且这些梯度中的每一个都是一个张量与sources.

的对应成员形状相同

问题

如果 target 包含要微分的单个 1x1 "tensor",则上面对 gradients 行为的描述是有意义的,因为从数学上讲,梯度向量应该与函数的域。

然而,如果target是张量列表,gradients的输出仍然是相同的形状。为什么会这样?如果 target 被认为是一个函数列表,那么输出不应该类似于雅可比行列式吗?我如何从概念上解释这种行为?

tf.GradientTape().gradient()就是这样定义的。它具有与 tf.gradients() 相同的功能,只是后者不能在急切模式下使用。来自tf.gradients()docs

It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys

其中 xssourcesystarget

示例 1:

所以让我们说 target = [y1, y2]sources = [x1, x2]。结果将是:

[dy1/dx1 + dy2/dx1, dy1/dx2 + dy2/dx2]

示例 2:

计算每个样本损失(张量)与减少损失(标量)的梯度

Let w, b be two variables. 
xentropy = [y1, y2] # tensor
reduced_xentropy = 0.5 * (y1 + y2) # scalar
grads = [dy1/dw + dy2/dw, dy1/db + dy2/db]
reduced_grads = [d(reduced_xentropy)/dw, d(reduced_xentropy)/db]
              = [d(0.5 * (y1 + y2))/dw, d(0.5 * (y1 + y2))/db] 
              == 0.5 * grads

上述片段的 Tensorflow 示例:

import tensorflow as tf

print(tf.__version__) # 2.1.0

inputs = tf.convert_to_tensor([[0.1, 0], [0.5, 0.51]]) # two two-dimensional samples
w = tf.Variable(initial_value=inputs)
b = tf.Variable(tf.zeros((2,)))
labels = tf.convert_to_tensor([0, 1])

def forward(inputs, labels, var_list):
    w, b = var_list
    logits = tf.matmul(inputs, w) + b
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
        labels=labels, logits=logits)
    return xentropy

# `xentropy` has two elements (gradients of tensor - gradient
# of each sample in a batch)
with tf.GradientTape() as g:
    xentropy = forward(inputs, labels, [w, b])
    reduced_xentropy = tf.reduce_mean(xentropy)
grads = g.gradient(xentropy, [w, b])
print(xentropy.numpy()) # [0.6881597  0.71584916]
print(grads[0].numpy()) # [[ 0.20586157 -0.20586154]
                        #  [ 0.2607238  -0.26072377]]

# `reduced_xentropy` is scalar (gradients of scalar)
with tf.GradientTape() as g:
    xentropy = forward(inputs, labels, [w, b])
    reduced_xentropy = tf.reduce_mean(xentropy)
grads_reduced = g.gradient(reduced_xentropy, [w, b])
print(reduced_xentropy.numpy()) # 0.70200443 <-- scalar
print(grads_reduced[0].numpy()) # [[ 0.10293078 -0.10293077]
                                #  [ 0.1303619  -0.13036188]]

如果计算批次中每个元素的损失 (xentropy),则每个变量的最终梯度将是批次中每个样本的所有梯度的总和(这是有道理的)。