Actor Critic 模型的一些权重没有更新

Question

我正在 Pytorch 中研究 Actor-Critic 模型。该模型首先在 RNN 中接收输入，然后策略网开始发挥作用。政策网的代码是：

class Policy(nn.Module):
    """
    implements both actor and critic in one model
    """
    def __init__(self):
        super(Policy, self).__init__()
        self.fc1 = nn.Linear(state_size+1, 128)

        self.fc2 = nn.Linear(128, 64)

        # actor's layer
        self.action_head = nn.Linear(64, action_size)
        self.mu = nn.Sigmoid()
        self.var = nn.Softplus()

        # critic's layer
        self.value_head = nn.Linear(64, 1)


    def forward(self, x):
        """
        forward of both actor and critic
        """
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))

        # actor: choses action to take from state s_t 
        # by returning probability of each action
        action_prob = self.action_head(x)
        mu = self.mu(action_prob)
        var = self.var(action_prob)

        # critic: evaluates being in the state s_t
        state_values = self.value_head(x)

        return mu, var, state_values
policy = Policy()

在模型 class 中，我们在 rnn 之后调用此策略。在 agent class 的 act 方法中，我们调用模型来获取这样的动作：

 def act(self, some_input, state):
      mu, var, state_value = self.model(some_input, state)
      mu = mu.data.cpu().numpy()
      sigma = torch.sqrt(var).data.cpu().numpy()
      action = np.random.normal(mu, sigma)
      action = np.clip(action, 0, 1)
      action = torch.from_numpy(action/1000)
      return action, state_value

我必须提到，在优化器中，我们正在调用 model.parameters。当我们打印每个 epoch 中的所有可训练参数时，我们看到除了 policy.action_head 之外的其他所有参数都在变化。知道为什么会这样吗？我还必须提到现在如何计算损失：

       advantage = reward - Value
       Lp = -math.log(pdf_prob_now)*advantage
       policy_losses.append(Lp)
       #similar for value_losses
#after all the runs in the epoch is done
loss = torch.stack(policy_losses).sum() + alpha*torch.stack(value_losses).sum()
loss.backward()

这里的值是 state_value（agent.act 的第二个输出），pdf_prob_now 是所有可能动作的动作概率，计算如下：

def find_pdf(policy, action, rnn_output):
    mu, var, _ = policy(rnn_output)
    mu = mu.data.cpu().numpy()
    sigma = torch.sqrt(var).data.cpu().numpy()
    pdf_probability = stats.norm.pdf(action.cpu(), loc=mu, scale=sigma)
    return pdf_probability

这里有逻辑错误吗？

Answer 1

错误在 act 函数中

def act(self, some_input, state):
    # mu contains info required for gradient
    mu, var, state_value = self.model(some_input, state)
    # mu is detached and now has forgot all the operations performed
    # in self.action_head
    mu = mu.data.cpu().numpy()
    sigma = torch.sqrt(var).data.cpu().numpy()
    action = np.random.normal(mu, sigma)
    action = np.clip(action, 0, 1)
    action = torch.from_numpy(action/1000)
    return action, state_value

对于进一步的过程，如果损失是使用对 action 执行的张量运算计算的，则无法追溯到更新 self.action_head 权重，因为您 detached 张量 mu 将其从 computation graph 中删除，因此您在 self.action_head.

中看不到任何更新

Actor Critic 模型的一些权重没有更新

Some weights of Actor Critic model not updating

reinforcement-learning

pytorch