TensorFlow 2.0:ValueError - 未提供梯度(修改 DDPG Actor 后)

TensorFlow 2.0 : ValueError - No Gradients Provided (After Modifying DDPG Actor)

背景

我目前正在尝试实现一个 DDPG 框架来控制一个简单的汽车代理。起初,汽车代理只需要通过调整加速度来学习如何尽快到达直线路径的尽头。这个任务很简单,所以我决定也引入一个额外的转向动作。我相应地更新了我的观察和行动空间。

下面几行是运行每一集的 for 循环:

for i in range(episodes):
    observation = env.reset()
    done = False
    score = 0
    while not done:
        action = agent.choose_action(observation, evaluate)
        observation_, reward, done, info = env.step(action)
        score += reward
        agent.remember(observation, action, reward, observation_, done)
        if not load_checkpoint:
            agent.learn()
        observation = observation_

下面几行是我的 choose_actionlearn 函数:

def choose_action(self, observation, evaluate=False):
    state = tf.convert_to_tensor([observation], dtype=tf.float32)
    actions = self.actor(state)
    if not evaluate:
        actions += tf.random.normal(shape=[self.n_actions],
                mean=0.0, stddev=self.noise)
    actions = tf.clip_by_value(actions, self.min_action, self.max_action)

    return actions[0]

def learn(self):
    if self.memory.mem_cntr < self.batch_size:
        return

    state, action, reward, new_state, done = \
            self.memory.sample_buffer(self.batch_size)

    states = tf.convert_to_tensor(state, dtype=tf.float32)
    states_ = tf.convert_to_tensor(new_state, dtype=tf.float32)
    rewards = tf.convert_to_tensor(reward, dtype=tf.float32)
    actions = tf.convert_to_tensor(action, dtype=tf.float32)

    with tf.GradientTape() as tape:
        target_actions = self.target_actor(states_)
        critic_value_ = tf.squeeze(self.target_critic(
                            states_, target_actions), 1)
        critic_value = tf.squeeze(self.critic(states, actions), 1)
        target = reward + self.gamma*critic_value_*(1-done)
        critic_loss = keras.losses.MSE(target, critic_value)

    critic_network_gradient = tape.gradient(critic_loss,
                                        self.critic.trainable_variables)
    self.critic.optimizer.apply_gradients(zip(
        critic_network_gradient, self.critic.trainable_variables))

    with tf.GradientTape() as tape:
        new_policy_actions = self.actor(states)
        actor_loss = -self.critic(states, new_policy_actions)
        actor_loss = tf.math.reduce_mean(actor_loss)

    actor_network_gradient = tape.gradient(actor_loss, 
                                self.actor.trainable_variables)
    self.actor.optimizer.apply_gradients(zip(
        actor_network_gradient, self.actor.trainable_variables))

    self.update_network_parameters()

最后,我的 ActorNetwork 如下:

class ActorNetwork(keras.Model):
    def __init__(self, fc1_dims=512, fc2_dims=512, n_actions=2, name='actor',
            chkpt_dir='tmp/ddpg'):
        super(ActorNetwork, self).__init__()
        self.fc1_dims = fc1_dims
        self.fc2_dims = fc2_dims
        self.n_actions = n_actions

        self.model_name = name
        self.checkpoint_dir = chkpt_dir
        self.checkpoint_file = os.path.join(self.checkpoint_dir, 
                    self.model_name+'_ddpg.h5')

        self.fc1 = Dense(self.fc1_dims, activation='relu')
        self.fc2 = Dense(self.fc2_dims, activation='relu')
        self.mu = Dense(self.n_actions, activation='tanh')

    def call(self, state):
        prob = self.fc1(state)
        prob = self.fc2(prob)

        mu = self.mu(prob) * 3.5 

        return mu

注意:我正在使用的代码只是从 this tutorial

中构建的代码

问题

到目前为止,我没有遇到任何代码问题,但我确实想调整我的操作的 maximum/minimum 值。当我只考虑加速作用时,我简单地将mu乘以3.5。但是,我希望转向动作存在于 -30 到 30 度的 运行ge 范围内,但我不能像以前那样只乘以 mu。为了尝试调整所需的转向 运行ge,我对 ActorNetwork

进行了以下(不太优雅)更改
def call(self, state):
    prob = self.fc1(state)
    prob = self.fc2(prob)

    mu = self.mu(prob)# * 3.5
    mu_ = []
    mu_l = mu.numpy().tolist()
    
    for i, elem1 in enumerate(mu_l):
        temp_ = []
        for j, elem2 in enumerate(elem1):
            if j-1 == 0:
                temp_.append(float(elem2 * 3.5))
            else:
                temp_.append(float(elem2 * math.radians(30)))
        mu_.append(temp_)
        
    mu = tf.convert_to_tensor(mu_, dtype=tf.float32)
    
    return mu

我添加的新行是为了:

  1. 将mu张量转换为列表
  2. 遍历 mu 列表中的元素 (mu_l),如果一个值的索引为 0(加速度),则乘以 3.5;否则,将 index=1(转向)处的值乘以 30 度的弧度转换。
  3. 将每个调整后的值附加到新列表 (mu_)
  4. 设置mu等于mu_的张量转换

正是在这一点上,我 运行 陷入了以下错误:

ValueError: No gradients provided for any variable: ['actor_network/dense/kernel:0', 'actor_network/dense/bias:0', 'actor_network/dense_1/kernel:0', 'actor_network/dense_1/bias:0', 'actor_network/dense_2/kernel:0', 'actor_network/dense_2/bias:0'].

我试图找到 Whosebug 内和外部资源提供的解决方案(例如,包括 watch,检查以确保我在 GradientTape() 中使用 model() 而不是 model.predict(),使确保我没有在 Tape 上下文之外执行计算)但是我没有运气解决这个问题。我怀疑我的问题与 中提出的问题类似,但我不确定如何诊断我的问题是否源于也用张量覆盖 mu。有人对这个问题有任何见解吗?

由于我 received on Reddit 的一些简单但有用的建议,问题已得到解决。我通过使用我的自定义 for 循环进行更改来中断对我的变量的跟踪。我应该改用 TensorFlow 函数。以下更改为我解决了问题:

def call(self, state):
    prob = self.fc1(state)
    prob = self.fc2(prob)

    mu = self.mu(prob)
    mult = tf.convert_to_tensor([3.5, math.radians(30)], dtype=tf.float32)
    mu = tf.math.multiply(mu, mult)
    
    return mu