MuZero 的伪代码中的奖励值是否未对齐？

Question

MuZero, a deep reinforcement learning technique, was just released, and I've been trying to implement it by looking at its pseudocode and this helpful tutorial 在 Medium 上。

但是，在伪代码训练期间如何处理奖励让我感到困惑，如果有人可以验证我是否正确阅读了代码，那将是很好的，如果我是，请解释为什么这个训练算法有效。

这是训练函数（来自pseudocode）：

def update_weights(optimizer: tf.train.Optimizer, network: Network, batch,
                   weight_decay: float):
  loss = 0
  for image, actions, targets in batch:
    # Initial step, from the real observation.
    value, reward, policy_logits, hidden_state = network.initial_inference(
        image)
    predictions = [(1.0, value, reward, policy_logits)]

    # Recurrent steps, from action and previous hidden state.
    for action in actions:
      value, reward, policy_logits, hidden_state = network.recurrent_inference(
          hidden_state, action)
      predictions.append((1.0 / len(actions), value, reward, policy_logits))

      hidden_state = tf.scale_gradient(hidden_state, 0.5)

    for prediction, target in zip(predictions, targets):
      gradient_scale, value, reward, policy_logits = prediction
      target_value, target_reward, target_policy = target

      l = (
          scalar_loss(value, target_value) +
          scalar_loss(reward, target_reward) +
          tf.nn.softmax_cross_entropy_with_logits(
              logits=policy_logits, labels=target_policy))

      loss += tf.scale_gradient(l, gradient_scale)

  for weights in network.get_weights():
    loss += weight_decay * tf.nn.l2_loss(weights)

  optimizer.minimize(loss)

我对损失中的 reward 特别感兴趣。请注意，损失从 predictions 获取其所有值。添加到 predictions 的第一个 reward 来自 network.initial_inference 函数。之后，predictions又增加了len(actions)个奖励，全部来自network.recurrent_inference函数

根据教程，initial_inference 和 recurrent_inference 函数由 3 个不同的函数构建而成：

预测输入：内部游戏状态。输出：政策，价值（未来最佳回报的预测总和）
Dynamics 输入：游戏的内部状态，动作。输出：采取该行动的奖励，游戏的新内部状态。
表示输入：游戏的外部状态。输出：游戏的内部状态

initial_inference 函数获取外部游戏状态，使用 representation 函数将其转换为内部状态，然后在该内部游戏状态上使用 prediction 函数.它输出内部状态、策略和值。

recurrent_inference 函数接收内部游戏状态和动作。它使用 dynamics 函数从旧游戏状态和动作中获取新的内部游戏状态和奖励。然后它将 prediction 函数应用于新的内部游戏状态，以获取该新内部状态的策略和值。因此，最终输出是一个新的内部状态、一个奖励、一个策略和一个值。

然而，在伪代码中，initial_inference函数也return是一个奖励。

我的主要问题：那个奖励代表什么？

在教程的 the tutorial, they just implicitly assume that the reward from the initial_inference function is 0. (See this image 中。）这是怎么回事？实际上没有奖励，所以 initial_inference 总是 return 奖励为 0？

让我们假设是这种情况。

在这个假设下，那么，predictions 列表中的第一个奖励将是 initial_inference 函数将 return 作为奖励的 0。然后，在损失中，这个 0 将与 target 列表的第一个元素进行比较。

target 的创建方式如下：

  def make_target(self, state_index: int, num_unroll_steps: int, td_steps: int,
                  to_play: Player):
    # The value target is the discounted root value of the search tree N steps
    # into the future, plus the discounted sum of all rewards until then.
    targets = []
    for current_index in range(state_index, state_index + num_unroll_steps + 1):
      bootstrap_index = current_index + td_steps
      if bootstrap_index < len(self.root_values):
        value = self.root_values[bootstrap_index] * self.discount**td_steps
      else:
        value = 0

      for i, reward in enumerate(self.rewards[current_index:bootstrap_index]):
        value += reward * self.discount**i  # pytype: disable=unsupported-operands

      if current_index < len(self.root_values):
        targets.append((value, self.rewards[current_index],
                        self.child_visits[current_index]))
      else:
        # States past the end of games are treated as absorbing states.
        targets.append((0, 0, []))
    return targets

这个函数编辑的targets return成为update_weights函数中的target列表。所以 targets 中的第一个值是 self.rewards[current_index]。 self.rewards 是玩游戏时收到的所有奖励的列表。唯一一次被编辑是在这个函数中 apply:

  def apply(self, action: Action):
    reward = self.environment.step(action)
    self.rewards.append(reward)
    self.history.append(action)

apply函数只在这里调用：

# Each game is produced by starting at the initial board position, then
# repeatedly executing a Monte Carlo Tree Search to generate moves until the end
# of the game is reached.
def play_game(config: MuZeroConfig, network: Network) -> Game:
  game = config.new_game()

  while not game.terminal() and len(game.history) < config.max_moves:
    # At the root of the search tree we use the representation function to
    # obtain a hidden state given the current observation.
    root = Node(0)
    current_observation = game.make_image(-1)
    expand_node(root, game.to_play(), game.legal_actions(),
                network.initial_inference(current_observation))
    add_exploration_noise(config, root)

    # We then run a Monte Carlo Tree Search using only action sequences and the
    # model learned by the network.
    run_mcts(config, root, game.action_history(), network)
    action = select_action(config, len(game.history), root, network)
    game.apply(action)
    game.store_search_statistics(root)
  return game

对我来说，似乎每次采取行动都会产生奖励。所以self.rewards列表中的第一个奖励应该是在游戏中采取第一个动作的奖励。

如果 current_index = 0 在 self.rewards[current_index] 中，问题就变得很清楚了。在这种情况下，predictions 列表的第一个奖励将是 0，因为它总是如此。但是，targets 列表将获得完成第一个动作的奖励。

所以，对我来说，奖励似乎错位了。

如果我们继续，predictions 列表中的第二个奖励将是 recurrent_inference 完成 第一个 行动的奖励。然而，targets列表中的第二个奖励将是游戏中存储的完成第二个动作的奖励。

总的来说，我有三个相互依存的问题：

initial_inference的奖励代表什么？（这是什么？）
如果是0，应该代表奖励，那么predictions和targets之间的奖励是不是错位了？（即 predictions 中的第二个奖励实际上应该与 targets 中的第一个奖励相匹配吗？）
如果它们未对齐，网络是否仍能正常训练和工作？

（另一个需要注意的好奇心是，尽管存在这种错位（假设存在错位），predictions 和 targets 长度确实具有相同的长度。目标长度由行定义上面make_target函数中的for current_index in range(state_index, state_index + num_unroll_steps + 1)，上面我们也计算出predictions的长度是len(actions) + 1，而len(actions)是由g.history[i:i + num_unroll_steps]定义的在 sample_batch 函数中（参见 the pseudocode）。因此，两个列表的长度相同。）

怎么回事？

Answer 1

作者在这里

What does the reward from the initial_inference represent?

初始推理"predicts"最后观察到的奖励。这实际上并没有用于任何事情，但使我们的代码更简单：预测头总是可以简单地预测紧接在前的奖励。对于动态网络，这将是在应用作为动态网络输入给出的操作后观察到的奖励。

游戏开始时没有最后观察到的奖励，所以我们直接设置为0。

伪代码中的奖励目标计算确实错位了；我刚刚上传了一个新版本到 arXiv。

过去常说的地方

      if current_index < len(self.root_values):
        targets.append((value, self.rewards[current_index],
                        self.child_visits[current_index]))
      else:
        # States past the end of games are treated as absorbing states.
        targets.append((0, 0, []))

应该是：

      # For simplicity the network always predicts the most recently received
      # reward, even for the initial representation network where we already
      # know this reward.
      if current_index > 0 and current_index <= len(self.rewards):
        last_reward = self.rewards[current_index - 1]
      else:
        last_reward = 0

      if current_index < len(self.root_values):
        targets.append((value, last_reward, self.child_visits[current_index]))
      else:
        # States past the end of games are treated as absorbing states.
        targets.append((0, last_reward, []))

希望对您有所帮助！

MuZero 的伪代码中的奖励值是否未对齐？

Is the reward value in MuZero's pseudocode misaligned?

python

algorithm

artificial-intelligence

structure

machine-learning