对 GradientTape.gradient 的概念理解
Conceptual understanding of GradientTape.gradient
背景
在 Tensorflow 2 中,存在一个名为 GradientTape
的 class,用于记录对张量的操作,然后可以将其结果微分并提供给某种最小化算法。例如,from the documentation我们有这个例子:
x = tf.constant(3.0)
with tf.GradientTape() as g:
g.watch(x)
y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0
gradient
方法的docstring意味着第一个参数不仅可以是张量,还可以是张量列表:
def gradient(self,
target,
sources,
output_gradients=None,
unconnected_gradients=UnconnectedGradients.NONE):
"""Computes the gradient using operations recorded in context of this tape.
Args:
target: a list or nested structure of Tensors or Variables to be
differentiated.
sources: a list or nested structure of Tensors or Variables. `target`
will be differentiated against elements in `sources`.
output_gradients: a list of gradients, one for each element of
target. Defaults to None.
unconnected_gradients: a value which can either hold 'none' or 'zero' and
alters the value which will be returned if the target and sources are
unconnected. The possible values and effects are detailed in
'UnconnectedGradients' and it defaults to 'none'.
Returns:
a list or nested structure of Tensors (or IndexedSlices, or None),
one for each element in `sources`. Returned structure is the same as
the structure of `sources`.
Raises:
RuntimeError: if called inside the context of the tape, or if called more
than once on a non-persistent tape.
ValueError: if the target is a variable or if unconnected gradients is
called with an unknown value.
"""
在上面的例子中,很容易看出y
,target
是待微分的函数,x
是因变量"gradient" 是关于.
根据我有限的经验,gradient
方法似乎是 returns 张量列表,每个元素一个 sources
,并且这些梯度中的每一个都是一个张量与sources
.
的对应成员形状相同
问题
如果 target
包含要微分的单个 1x1 "tensor",则上面对 gradients
行为的描述是有意义的,因为从数学上讲,梯度向量应该与函数的域。
然而,如果target
是张量列表,gradients
的输出仍然是相同的形状。为什么会这样?如果 target
被认为是一个函数列表,那么输出不应该类似于雅可比行列式吗?我如何从概念上解释这种行为?
tf.GradientTape().gradient()
就是这样定义的。它具有与 tf.gradients()
相同的功能,只是后者不能在急切模式下使用。来自tf.gradients()
的docs:
It returns a list of Tensor of length len(xs)
where each tensor is the sum(dy/dx) for y in ys
其中 xs
是 sources
,ys
是 target
。
示例 1:
所以让我们说 target = [y1, y2]
和 sources = [x1, x2]
。结果将是:
[dy1/dx1 + dy2/dx1, dy1/dx2 + dy2/dx2]
示例 2:
计算每个样本损失(张量)与减少损失(标量)的梯度
Let w, b be two variables.
xentropy = [y1, y2] # tensor
reduced_xentropy = 0.5 * (y1 + y2) # scalar
grads = [dy1/dw + dy2/dw, dy1/db + dy2/db]
reduced_grads = [d(reduced_xentropy)/dw, d(reduced_xentropy)/db]
= [d(0.5 * (y1 + y2))/dw, d(0.5 * (y1 + y2))/db]
== 0.5 * grads
上述片段的 Tensorflow 示例:
import tensorflow as tf
print(tf.__version__) # 2.1.0
inputs = tf.convert_to_tensor([[0.1, 0], [0.5, 0.51]]) # two two-dimensional samples
w = tf.Variable(initial_value=inputs)
b = tf.Variable(tf.zeros((2,)))
labels = tf.convert_to_tensor([0, 1])
def forward(inputs, labels, var_list):
w, b = var_list
logits = tf.matmul(inputs, w) + b
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=labels, logits=logits)
return xentropy
# `xentropy` has two elements (gradients of tensor - gradient
# of each sample in a batch)
with tf.GradientTape() as g:
xentropy = forward(inputs, labels, [w, b])
reduced_xentropy = tf.reduce_mean(xentropy)
grads = g.gradient(xentropy, [w, b])
print(xentropy.numpy()) # [0.6881597 0.71584916]
print(grads[0].numpy()) # [[ 0.20586157 -0.20586154]
# [ 0.2607238 -0.26072377]]
# `reduced_xentropy` is scalar (gradients of scalar)
with tf.GradientTape() as g:
xentropy = forward(inputs, labels, [w, b])
reduced_xentropy = tf.reduce_mean(xentropy)
grads_reduced = g.gradient(reduced_xentropy, [w, b])
print(reduced_xentropy.numpy()) # 0.70200443 <-- scalar
print(grads_reduced[0].numpy()) # [[ 0.10293078 -0.10293077]
# [ 0.1303619 -0.13036188]]
如果计算批次中每个元素的损失 (xentropy
),则每个变量的最终梯度将是批次中每个样本的所有梯度的总和(这是有道理的)。
背景
在 Tensorflow 2 中,存在一个名为 GradientTape
的 class,用于记录对张量的操作,然后可以将其结果微分并提供给某种最小化算法。例如,from the documentation我们有这个例子:
x = tf.constant(3.0)
with tf.GradientTape() as g:
g.watch(x)
y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0
gradient
方法的docstring意味着第一个参数不仅可以是张量,还可以是张量列表:
def gradient(self,
target,
sources,
output_gradients=None,
unconnected_gradients=UnconnectedGradients.NONE):
"""Computes the gradient using operations recorded in context of this tape.
Args:
target: a list or nested structure of Tensors or Variables to be
differentiated.
sources: a list or nested structure of Tensors or Variables. `target`
will be differentiated against elements in `sources`.
output_gradients: a list of gradients, one for each element of
target. Defaults to None.
unconnected_gradients: a value which can either hold 'none' or 'zero' and
alters the value which will be returned if the target and sources are
unconnected. The possible values and effects are detailed in
'UnconnectedGradients' and it defaults to 'none'.
Returns:
a list or nested structure of Tensors (or IndexedSlices, or None),
one for each element in `sources`. Returned structure is the same as
the structure of `sources`.
Raises:
RuntimeError: if called inside the context of the tape, or if called more
than once on a non-persistent tape.
ValueError: if the target is a variable or if unconnected gradients is
called with an unknown value.
"""
在上面的例子中,很容易看出y
,target
是待微分的函数,x
是因变量"gradient" 是关于.
根据我有限的经验,gradient
方法似乎是 returns 张量列表,每个元素一个 sources
,并且这些梯度中的每一个都是一个张量与sources
.
问题
如果 target
包含要微分的单个 1x1 "tensor",则上面对 gradients
行为的描述是有意义的,因为从数学上讲,梯度向量应该与函数的域。
然而,如果target
是张量列表,gradients
的输出仍然是相同的形状。为什么会这样?如果 target
被认为是一个函数列表,那么输出不应该类似于雅可比行列式吗?我如何从概念上解释这种行为?
tf.GradientTape().gradient()
就是这样定义的。它具有与 tf.gradients()
相同的功能,只是后者不能在急切模式下使用。来自tf.gradients()
的docs:
It returns a list of Tensor of length
len(xs)
where each tensor is thesum(dy/dx) for y in ys
其中 xs
是 sources
,ys
是 target
。
示例 1:
所以让我们说 target = [y1, y2]
和 sources = [x1, x2]
。结果将是:
[dy1/dx1 + dy2/dx1, dy1/dx2 + dy2/dx2]
示例 2:
计算每个样本损失(张量)与减少损失(标量)的梯度
Let w, b be two variables.
xentropy = [y1, y2] # tensor
reduced_xentropy = 0.5 * (y1 + y2) # scalar
grads = [dy1/dw + dy2/dw, dy1/db + dy2/db]
reduced_grads = [d(reduced_xentropy)/dw, d(reduced_xentropy)/db]
= [d(0.5 * (y1 + y2))/dw, d(0.5 * (y1 + y2))/db]
== 0.5 * grads
上述片段的 Tensorflow 示例:
import tensorflow as tf
print(tf.__version__) # 2.1.0
inputs = tf.convert_to_tensor([[0.1, 0], [0.5, 0.51]]) # two two-dimensional samples
w = tf.Variable(initial_value=inputs)
b = tf.Variable(tf.zeros((2,)))
labels = tf.convert_to_tensor([0, 1])
def forward(inputs, labels, var_list):
w, b = var_list
logits = tf.matmul(inputs, w) + b
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=labels, logits=logits)
return xentropy
# `xentropy` has two elements (gradients of tensor - gradient
# of each sample in a batch)
with tf.GradientTape() as g:
xentropy = forward(inputs, labels, [w, b])
reduced_xentropy = tf.reduce_mean(xentropy)
grads = g.gradient(xentropy, [w, b])
print(xentropy.numpy()) # [0.6881597 0.71584916]
print(grads[0].numpy()) # [[ 0.20586157 -0.20586154]
# [ 0.2607238 -0.26072377]]
# `reduced_xentropy` is scalar (gradients of scalar)
with tf.GradientTape() as g:
xentropy = forward(inputs, labels, [w, b])
reduced_xentropy = tf.reduce_mean(xentropy)
grads_reduced = g.gradient(reduced_xentropy, [w, b])
print(reduced_xentropy.numpy()) # 0.70200443 <-- scalar
print(grads_reduced[0].numpy()) # [[ 0.10293078 -0.10293077]
# [ 0.1303619 -0.13036188]]
如果计算批次中每个元素的损失 (xentropy
),则每个变量的最终梯度将是批次中每个样本的所有梯度的总和(这是有道理的)。