为什么输入在 tensorflow 中按 tf.nn.dropout 缩放？

Question

我无法理解为什么 dropout 在 tensorflow 中会这样工作。 CS231n 的博客说，"dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise." 你也可以从图片中看到这个（取自同一站点）

来自 tensorflow 网站，With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0.

现在，为什么输入元素按比例放大 1/keep_prob？为什么不按概率保留输入元素而不用 1/keep_prob 缩放它呢？

Answer 1

这种缩放使得同一个网络可以用于训练（keep_prob < 1.0）和评估（keep_prob == 1.0）。来自 Dropout paper:

The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.

TensorFlow 实现不是在测试时添加操作以将权重缩小 keep_prob，而是在训练时添加操作以将权重扩大 1. / keep_prob。对性能的影响可以忽略不计，代码更简单（因为我们使用相同的图并将 keep_prob 视为 tf.placeholder()，根据我们是训练还是评估网络，它被提供不同的值） .

Answer 2

假设网络有 n 个神经元，我们应用了丢失率 1/2

训练阶段，我们将剩下 n/2 个神经元。因此，如果您期望所有神经元的输出 x，那么现在您将获得 x/2。所以对于每一个批次，网络权重都是根据这个 x/2

Testing/Inference/Validation阶段，我们不应用任何dropout所以输出是x。因此，在这种情况下，输出将带有 x 而不是 x/2，这会给您不正确的结果。所以你可以做的是在测试期间将它缩放到 x/2。

而不是上述特定于测试阶段的缩放。 Tensorflow 的 dropout layer 的作用是不管有没有 dropout（Training or testing），它都会对输出进行缩放，使得总和不变。

Answer 3

如果您继续阅读 cs231n，dropout 和 inverted dropout 之间的区别将得到解释。

在没有 dropout 的网络中，L 层的激活值为 aL。下一层 (L+1) 的权重将以接收 aL 并相应地产生输出的方式学习。但是对于包含 dropout 的网络（keep_prob = p），L+1 的权重将以接收 p*aL 并相应地产生输出的方式学习. 为什么 p*aL？因为期望值 E(aL) 将是 probability_of_keeping(aL)*aL + probability_of_not_keeping(aL)*0 等于 p*aL + (1-p)*0 = p*aL。在同一个网络中，在测试期间不会有 dropout。因此，层 L+1 将简单地接收 aL。但是它的权重被训练为期望 p*aL 作为输入。因此，在测试期间，您必须将激活值乘以 p。但是，您可以只在训练期间将激活乘以 1/p，而不是这样做。这被称为 inverted dropout.

因为我们想在测试时保持前向传递不变（并在训练期间调整我们的网络），tf.nn.dropout 直接实现 反向丢弃，缩放值.

Answer 4

这里有一个快速实验，可以消除任何遗留的困惑。

从统计上讲，NN 层的权重遵循通常接近正态（但不一定）的分布，但即使在尝试对完美正态进行采样的情况下也是如此分布在实践中，总是存在计算错误。

然后考虑下面的实验：

DIM = 1_000_000                      # set our dims for weights and input
x = np.ones((DIM,1))                 # our input vector
#x = np.random.rand(DIM,1)*2-1.0     # or could also be a more realistic normalized input

probs = [1.0, 0.7, 0.5, 0.3]         # define dropout probs

W = np.random.normal(size=(DIM,1))   # sample normally distributed weights
print("W-mean = ", W.mean())         # note the mean is not perfect --> sampling error!

# DO THE DRILL
h = defaultdict(list)
for i in range(1000):
  for p in probs:
    M = np.random.rand(DIM,1)
    M = (M < p).astype(int)
    Wp = W * M
    a = np.dot(Wp.T, x)
    h[str(p)].append(a)

for k,v in h.items():
  print("For drop-out prob %r the average linear activation is %r (unscaled) and %r (scaled)" % (k, np.mean(v), np.mean(v)/float(k)))

示例输出：

x-mean =  1.0
W-mean =  -0.001003985674840264
For drop-out prob '1.0' the average linear activation is -1003.985674840258 (unscaled) and -1003.985674840258 (scaled)
For drop-out prob '0.7' the average linear activation is -700.6128015029908 (unscaled) and -1000.8754307185584 (scaled)
For drop-out prob '0.5' the average linear activation is -512.1602655283492 (unscaled) and -1024.3205310566984 (scaled)
For drop-out prob '0.3' the average linear activation is -303.21194422742315 (unscaled) and -1010.7064807580772 (scaled)

请注意，由于统计上不完美的正态分布，未缩放的激活减少了。

你能发现 W-mean 和平均线性激活均值之间的明显相关性吗？

为什么输入在 tensorflow 中按 tf.nn.dropout 缩放？

Why input is scaled in tf.nn.dropout in tensorflow?

machine-learning

neural-network

deep-learning

tensorflow