张量的元素 0 不需要 grad 并且没有 grad_fn
element 0 of tensors does not require grad and does not have a grad_fn
我正在尝试将强化学习机制应用于分类任务。
我知道这样做是没有用的,因为深度学习在任务中的表现可能超过 RL。不管怎样,我正在做研究。
如果代理人正确为正 1 或不为负 -1,我会奖励他
并用 predicted_action(predicted_class)
和奖励计算损失 FUNC。
但是我得到一个错误:
element 0 of tensors does not require grad and does not have a grad_fn
# creating model
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.pipe = nn.Sequential(nn.Linear(9, 120),
nn.ReLU(),
nn.Linear(120, 64),
nn.ReLU(),
nn.Linear(64,2),
nn.Softmax()
)
def forward(self, x):
return self.pipe(x)
def env_step(action, label, size):
total_reward = []
for i in range(size):
reward = 0
if action[i] == label[i]:
total_reward.append(reward+1)
continue
else:
total_reward.append(reward-1)
continue
return total_reward
if __name__=='__main__':
epoch_size = 100
net = Net()
criterion = nn.MSELoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)
total_loss = deque(maxlen = 50)
for epoch in range(epoch_size):
batch_index = 0
for i in range(13):
# batch sample
batch_xs = torch.FloatTensor(train_state[batch_index: batch_index+50]) # make tensor
batch_ys = torch.from_numpy(train_label[batch_index: batch_index+50]).type('torch.LongTensor') # make tensor
# action_prob; e.g classification prob
actions_prob = net(batch_xs)
#print(actions_prob)
action = torch.argmax(actions_prob, dim=1).unsqueeze(1)
#print(action)
reward = np.array(env_step(action, batch_ys, 50))
#print(reward)
reward = torch.from_numpy(reward).unsqueeze(1).type('torch.FloatTensor')
#print(reward)
action = action.type('torch.FloatTensor')
optimizer.zero_grad()
loss = criterion(action, reward)
loss.backward()
optimizer.step()
batch_index += 50
action
是由 argmax 函数产生的,它是不可微分的。相反,您希望在奖励和所采取行动的 责任概率 之间取损失。
通常,在强化学习中为策略选择的 "loss" 就是所谓的 score function:
这是采取的行动 a
的责任概率乘以获得的奖励的对数乘积。
我正在尝试将强化学习机制应用于分类任务。 我知道这样做是没有用的,因为深度学习在任务中的表现可能超过 RL。不管怎样,我正在做研究。
如果代理人正确为正 1 或不为负 -1,我会奖励他
并用 predicted_action(predicted_class)
和奖励计算损失 FUNC。
但是我得到一个错误:
element 0 of tensors does not require grad and does not have a grad_fn
# creating model
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.pipe = nn.Sequential(nn.Linear(9, 120),
nn.ReLU(),
nn.Linear(120, 64),
nn.ReLU(),
nn.Linear(64,2),
nn.Softmax()
)
def forward(self, x):
return self.pipe(x)
def env_step(action, label, size):
total_reward = []
for i in range(size):
reward = 0
if action[i] == label[i]:
total_reward.append(reward+1)
continue
else:
total_reward.append(reward-1)
continue
return total_reward
if __name__=='__main__':
epoch_size = 100
net = Net()
criterion = nn.MSELoss()
optimizer = optim.Adam(params=net.parameters(), lr=0.01)
total_loss = deque(maxlen = 50)
for epoch in range(epoch_size):
batch_index = 0
for i in range(13):
# batch sample
batch_xs = torch.FloatTensor(train_state[batch_index: batch_index+50]) # make tensor
batch_ys = torch.from_numpy(train_label[batch_index: batch_index+50]).type('torch.LongTensor') # make tensor
# action_prob; e.g classification prob
actions_prob = net(batch_xs)
#print(actions_prob)
action = torch.argmax(actions_prob, dim=1).unsqueeze(1)
#print(action)
reward = np.array(env_step(action, batch_ys, 50))
#print(reward)
reward = torch.from_numpy(reward).unsqueeze(1).type('torch.FloatTensor')
#print(reward)
action = action.type('torch.FloatTensor')
optimizer.zero_grad()
loss = criterion(action, reward)
loss.backward()
optimizer.step()
batch_index += 50
action
是由 argmax 函数产生的,它是不可微分的。相反,您希望在奖励和所采取行动的 责任概率 之间取损失。
通常,在强化学习中为策略选择的 "loss" 就是所谓的 score function:
这是采取的行动 a
的责任概率乘以获得的奖励的对数乘积。