Actor Critic 模型的一些权重没有更新
Some weights of Actor Critic model not updating
我正在 Pytorch 中研究 Actor-Critic 模型。该模型首先在 RNN 中接收输入,然后策略网开始发挥作用。政策网的代码是:
class Policy(nn.Module):
"""
implements both actor and critic in one model
"""
def __init__(self):
super(Policy, self).__init__()
self.fc1 = nn.Linear(state_size+1, 128)
self.fc2 = nn.Linear(128, 64)
# actor's layer
self.action_head = nn.Linear(64, action_size)
self.mu = nn.Sigmoid()
self.var = nn.Softplus()
# critic's layer
self.value_head = nn.Linear(64, 1)
def forward(self, x):
"""
forward of both actor and critic
"""
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
# actor: choses action to take from state s_t
# by returning probability of each action
action_prob = self.action_head(x)
mu = self.mu(action_prob)
var = self.var(action_prob)
# critic: evaluates being in the state s_t
state_values = self.value_head(x)
return mu, var, state_values
policy = Policy()
在模型 class 中,我们在 rnn 之后调用此策略。在 agent class 的 act 方法中,我们调用模型来获取这样的动作:
def act(self, some_input, state):
mu, var, state_value = self.model(some_input, state)
mu = mu.data.cpu().numpy()
sigma = torch.sqrt(var).data.cpu().numpy()
action = np.random.normal(mu, sigma)
action = np.clip(action, 0, 1)
action = torch.from_numpy(action/1000)
return action, state_value
我必须提到,在优化器中,我们正在调用 model.parameters。当我们打印每个 epoch 中的所有可训练参数时,我们看到除了 policy.action_head 之外的其他所有参数都在变化。知道为什么会这样吗?我还必须提到现在如何计算损失:
advantage = reward - Value
Lp = -math.log(pdf_prob_now)*advantage
policy_losses.append(Lp)
#similar for value_losses
#after all the runs in the epoch is done
loss = torch.stack(policy_losses).sum() + alpha*torch.stack(value_losses).sum()
loss.backward()
这里的值是 state_value(agent.act 的第二个输出),pdf_prob_now 是所有可能动作的动作概率,计算如下:
def find_pdf(policy, action, rnn_output):
mu, var, _ = policy(rnn_output)
mu = mu.data.cpu().numpy()
sigma = torch.sqrt(var).data.cpu().numpy()
pdf_probability = stats.norm.pdf(action.cpu(), loc=mu, scale=sigma)
return pdf_probability
这里有逻辑错误吗?
错误在 act
函数中
def act(self, some_input, state):
# mu contains info required for gradient
mu, var, state_value = self.model(some_input, state)
# mu is detached and now has forgot all the operations performed
# in self.action_head
mu = mu.data.cpu().numpy()
sigma = torch.sqrt(var).data.cpu().numpy()
action = np.random.normal(mu, sigma)
action = np.clip(action, 0, 1)
action = torch.from_numpy(action/1000)
return action, state_value
对于进一步的过程,如果损失是使用对 action
执行的张量运算计算的,则无法追溯到更新 self.action_head
权重,因为您 detached
张量 mu
将其从 computation graph 中删除,因此您在 self.action_head
.
中看不到任何更新
我正在 Pytorch 中研究 Actor-Critic 模型。该模型首先在 RNN 中接收输入,然后策略网开始发挥作用。政策网的代码是:
class Policy(nn.Module):
"""
implements both actor and critic in one model
"""
def __init__(self):
super(Policy, self).__init__()
self.fc1 = nn.Linear(state_size+1, 128)
self.fc2 = nn.Linear(128, 64)
# actor's layer
self.action_head = nn.Linear(64, action_size)
self.mu = nn.Sigmoid()
self.var = nn.Softplus()
# critic's layer
self.value_head = nn.Linear(64, 1)
def forward(self, x):
"""
forward of both actor and critic
"""
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
# actor: choses action to take from state s_t
# by returning probability of each action
action_prob = self.action_head(x)
mu = self.mu(action_prob)
var = self.var(action_prob)
# critic: evaluates being in the state s_t
state_values = self.value_head(x)
return mu, var, state_values
policy = Policy()
在模型 class 中,我们在 rnn 之后调用此策略。在 agent class 的 act 方法中,我们调用模型来获取这样的动作:
def act(self, some_input, state):
mu, var, state_value = self.model(some_input, state)
mu = mu.data.cpu().numpy()
sigma = torch.sqrt(var).data.cpu().numpy()
action = np.random.normal(mu, sigma)
action = np.clip(action, 0, 1)
action = torch.from_numpy(action/1000)
return action, state_value
我必须提到,在优化器中,我们正在调用 model.parameters。当我们打印每个 epoch 中的所有可训练参数时,我们看到除了 policy.action_head 之外的其他所有参数都在变化。知道为什么会这样吗?我还必须提到现在如何计算损失:
advantage = reward - Value
Lp = -math.log(pdf_prob_now)*advantage
policy_losses.append(Lp)
#similar for value_losses
#after all the runs in the epoch is done
loss = torch.stack(policy_losses).sum() + alpha*torch.stack(value_losses).sum()
loss.backward()
这里的值是 state_value(agent.act 的第二个输出),pdf_prob_now 是所有可能动作的动作概率,计算如下:
def find_pdf(policy, action, rnn_output):
mu, var, _ = policy(rnn_output)
mu = mu.data.cpu().numpy()
sigma = torch.sqrt(var).data.cpu().numpy()
pdf_probability = stats.norm.pdf(action.cpu(), loc=mu, scale=sigma)
return pdf_probability
这里有逻辑错误吗?
错误在 act
函数中
def act(self, some_input, state):
# mu contains info required for gradient
mu, var, state_value = self.model(some_input, state)
# mu is detached and now has forgot all the operations performed
# in self.action_head
mu = mu.data.cpu().numpy()
sigma = torch.sqrt(var).data.cpu().numpy()
action = np.random.normal(mu, sigma)
action = np.clip(action, 0, 1)
action = torch.from_numpy(action/1000)
return action, state_value
对于进一步的过程,如果损失是使用对 action
执行的张量运算计算的,则无法追溯到更新 self.action_head
权重,因为您 detached
张量 mu
将其从 computation graph 中删除,因此您在 self.action_head
.