Actor Critic 模型的一些权重没有更新

Some weights of Actor Critic model not updating

我正在 Pytorch 中研究 Actor-Critic 模型。该模型首先在 RNN 中接收输入,然后策略网开始发挥作用。政策网的代码是:

class Policy(nn.Module):
    """
    implements both actor and critic in one model
    """
    def __init__(self):
        super(Policy, self).__init__()
        self.fc1 = nn.Linear(state_size+1, 128)

        self.fc2 = nn.Linear(128, 64)

        # actor's layer
        self.action_head = nn.Linear(64, action_size)
        self.mu = nn.Sigmoid()
        self.var = nn.Softplus()

        # critic's layer
        self.value_head = nn.Linear(64, 1)


    def forward(self, x):
        """
        forward of both actor and critic
        """
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))

        # actor: choses action to take from state s_t 
        # by returning probability of each action
        action_prob = self.action_head(x)
        mu = self.mu(action_prob)
        var = self.var(action_prob)

        # critic: evaluates being in the state s_t
        state_values = self.value_head(x)

        return mu, var, state_values
policy = Policy()

在模型 class 中,我们在 rnn 之后调用此策略。在 agent class 的 act 方法中,我们调用模型来获取这样的动作:

 def act(self, some_input, state):
      mu, var, state_value = self.model(some_input, state)
      mu = mu.data.cpu().numpy()
      sigma = torch.sqrt(var).data.cpu().numpy()
      action = np.random.normal(mu, sigma)
      action = np.clip(action, 0, 1)
      action = torch.from_numpy(action/1000)
      return action, state_value

我必须提到,在优化器中,我们正在调用 model.parameters。当我们打印每个 epoch 中的所有可训练参数时,我们看到除了 policy.action_head 之外的其他所有参数都在变化。知道为什么会这样吗?我还必须提到现在如何计算损失:

       advantage = reward - Value
       Lp = -math.log(pdf_prob_now)*advantage
       policy_losses.append(Lp)
       #similar for value_losses
#after all the runs in the epoch is done
loss = torch.stack(policy_losses).sum() + alpha*torch.stack(value_losses).sum()
loss.backward()

这里的值是 state_value(agent.act 的第二个输出),pdf_prob_now 是所有可能动作的动作概率,计算如下:

def find_pdf(policy, action, rnn_output):
    mu, var, _ = policy(rnn_output)
    mu = mu.data.cpu().numpy()
    sigma = torch.sqrt(var).data.cpu().numpy()
    pdf_probability = stats.norm.pdf(action.cpu(), loc=mu, scale=sigma)
    return pdf_probability

这里有逻辑错误吗?

错误在 act 函数中

def act(self, some_input, state):
    # mu contains info required for gradient
    mu, var, state_value = self.model(some_input, state)
    # mu is detached and now has forgot all the operations performed
    # in self.action_head
    mu = mu.data.cpu().numpy()
    sigma = torch.sqrt(var).data.cpu().numpy()
    action = np.random.normal(mu, sigma)
    action = np.clip(action, 0, 1)
    action = torch.from_numpy(action/1000)
    return action, state_value

对于进一步的过程,如果损失是使用对 action 执行的张量运算计算的,则无法追溯到更新 self.action_head 权重,因为您 detached 张量 mu 将其从 computation graph 中删除,因此您在 self.action_head.

中看不到任何更新