如何得到梯度的二阶矩

Question

在 OpenAI Five paper 中提到“每个参数的梯度被额外剪切在 ±5√v 之间，其中 v 是（未剪切的）梯度的二阶矩的运行估计。”。这是我想在我的项目中实现的东西，但我不确定在理论上和实践中该怎么做。

来自 wikipedia I found out that the "The second central moment is the variance. The positive square root of the variance is the standard deviation [...]". My best guess regarding the "running estimate" is that it is the Exponential Moving Average. The gradients of a network can be accessed as this comment suggests.

从这些我假设√v是标准开发的指数运行平均值。的梯度，可以通过以下方式计算： estimate = alpha * torch.std(list(param.grad for param in model.parameters())) + (1-alpha) * estimate

我的理论正确吗？有更好的方法吗？提前致谢。

编辑：修复了 Mr. For Example 回答后的梯度收集问题。

Answer 1

我觉得你的方向是对的，我的猜测和你的基本一致，只是略有不同。

首先，什么是矩？

随机变量的第 N 个矩定义为该变量的期望值的 n 次方。更正式地说：

m — 矩，X — 随机变量

所以第一个矩是均值，第二个矩是非中心方差（意思是我们在计算方差的时候不减去均值），直觉上，通过移动剪裁梯度其标准偏差的平均值为零是有意义的。

二、正确的代码是什么？

list(network.parameters())只给你参数，要得到每个参数的梯度你需要[param.grad for param in network.parameters()]

根据我们上面所知道的所有事情，正确的代码应该是（你可以尝试通过各种方式优化它）：

grads_square = torch.FloatTensor([torch.square(param.grad) for param in network.parameters()])
estimate = alpha * torch.sqrt(torch.mean(grads_square)) + (1-alpha) * estimate

如何得到梯度的二阶矩

How to get the second moment of the gradient

machine-learning

reinforcement-learning

pytorch