Karpathy Pong cross-entropy/log y 的损失解释 - aprob

Question

我试图理解 Python 中 Karpathy 的 pong 代码，此处解释如下：karpathy pong

# forward the policy network and sample an action from the returned probability
  #########action 2 is up and 3 is down
  aprob, h = policy_forward(x)
  print("aprob\n {}\n h\n {}\n".format(aprob, h))
  #2 is up, 3 is down
  action = 2 if np.random.uniform() < aprob else 3 # roll the dice!
  print("action\n {}\n".format(action))
  # record various intermediates (needed later for backprop)
  xs.append(x) # observation, ie. the difference frame?
  #print("xs {}".format(xs))
  hs.append(h) # hidden state obtained from forward pass
  #print("hs {}".format(hs)) 
  #if action is up, y = 1, else 0
  y = 1 if action == 2 else 0 # a "fake label"
  print("y \n{}\n".format(y))
  dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken (see http://cs231n.github.io/neural-networks-2/#losses if confused)
  print("dlogps\n {}\n".format(dlogps))
  # step the environment and get new measurements
  observation, reward, done, info = env.step(action)
  print("observation\n {}\n reward\n {}\n done\n {}\n ".format(observation, reward, done))
  reward_sum += reward
  print("reward_sum\n {}\n".format(reward_sum))
  drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)
  print("drs\n {}\n".format(drs))
  if done: # an episode finished
    episode_number += 1

在上面的片段中，我不太明白为什么需要假标签以及这意味着什么：
dlogps.append(y - aprob)# grad that encourages the action that was taken to be taken (see http://cs231n.github.io/neural-networks-2/#losses if confused)

为什么是假标签y减去aprob？

我的理解是网络输出 "log probability" 向上移动，但解释似乎表明标签实际上应该是采取该行动获得的奖励，然后鼓励情节中的所有行动，如果这是一个成功的。因此，我不明白 1 或 0 的假标签有何帮助。

同样在forward pass函数中，没有对数运算怎么算对数概率？

#forward pass, how is logp a logp without any log operation?????
def policy_forward(x):
  h = np.dot(model['W1'], x)
  h[h<0] = 0 # ReLU nonlinearity
  logp = np.dot(model['W2'], h)
  p = sigmoid(logp)
  #print("p\n {}\n and h\n {}\n".format(p, h))
  return p, h # return probability of taking action 2 (up), and hidden state

编辑：

我使用 print 语句查看引擎盖下发生的情况，发现由于 y=0 表示向下操作，(y - aprob) 表示向下操作。他用 epdlogp *= discounted_epr 优势调制梯度的公式最终仍然表明向下移动是否好，即。一个负数或坏的，即。一个正数。
对于行动起来，当应用公式时，情况正好相反。 IE。 epdlogp *= discounted_epr 的正数表示行为良好，负数表示行为不良。
所以这似乎是一种相当巧妙的实现方式，但我仍然不明白从正向传递返回的 aprob 是如何对数概率的，因为输出到控制台看起来像这样：

aprob
 0.5

action
 3

aprob
 0.5010495775824385

action
 2

aprob
 0.5023498477623756

action
 2

aprob
 0.5051575154468827

action
 2

那些看起来像是 0 到 1 之间的概率。那么，将 y - aprob 用作 "log probability" 只是经过数月和数年的实践形成的直觉的黑客攻击吗？如果是这样，这些黑客是通过反复试验发现的吗？

编辑：感谢 Tommy 的出色解释，我知道在我的 Udacity 深度学习课程视频中查找对数概率和交叉熵的复习：https://www.youtube.com/watch?time_continue=94&v=iREoPUrpXvE

另外，这个备忘单也有帮助：https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html

Answer 1

我对他如何到达 (y-aprob) 的解释：

当他通过他的网络进行前向传递时，最后一步是将 sigmoid S(x) 应用于最后一个神经元的输出。

S(x) = 1 / (1+e^-x)

及其梯度

grad S(x) = S(x)(1-S(X))

为了increase/decrease你的行动的可能性，你必须计算你的'labels'

概率的对数

L = log p(y|x)

要反向传播这个，你必须计算你的可能性的梯度 L

grad L = grad log p(y|x)

由于您在输出上应用了 sigmoid 函数 p = S(y)，您实际上计算了

grad L = grad log S(y)   
grad L = 1 / S(y) * S(y)(1-(S(y))  
grad L = (1-S(y))  
**grad L = (1-p)**

这其实就是Log Loss / Cross Entropy。一个更通用的公式是：

L = - (y log p + (1-y)log(1-p))  
grad L = y-p with y either 0 or 1

由于 Andrej 在他的示例中没有使用像 Tensorflow 或 PyTorch 这样的框架，他在那里做了一些反向传播。

一开始我也很困惑，我花了一些时间才弄清楚那里发生了什么魔法。也许他可以在那里更清楚一点并给出一些提示。

至少这是我对他的代码的拙劣理解:)

Karpathy Pong cross-entropy/log y 的损失解释 - aprob

Karpathy Pong cross-entropy/log loss explanation for y - aprob

python

gradient

reinforcement-learning