为什么亏损会上升？

Question

执行下面的代码有时会导致训练过程中的损失上升，然后停留在那里。这是为什么？

import tensorflow as tf
from tensorflow.keras import layers, losses, models

FEATURE_COUNT = 2
TRAINING_SET_SIZE = 128


def patch_nans(t: tf.Tensor) -> tf.Tensor:
    """:return t with nans replaced by zeros"""
    nan_mask = tf.math.is_nan(t)
    return tf.where(nan_mask, tf.zeros_like(t), t)


def check_numerics(t: tf.Tensor) -> tf.Tensor:
    """Throw an exception if t contains nans."""
    return tf.debugging.check_numerics(t, "t")


def get_model() -> models.Model:
    inp = layers.Input(shape=[FEATURE_COUNT])
    mid = layers.Dense(units=64)(inp)
    mid = layers.ReLU()(mid)
    mid = layers.Dense(units=1)(mid)
    mid = layers.Lambda(patch_nans)(mid)
    out = layers.Lambda(check_numerics)(mid)
    return models.Model(inp, out)


model = get_model()
model.compile(
    optimizer=tf.optimizers.SGD(),
    loss=losses.mean_squared_error
)
model.summary()

features = tf.random.normal(shape=[TRAINING_SET_SIZE, FEATURE_COUNT])
features_with_nans = tf.maximum(tf.math.log(features + 1), tf.zeros_like(features))
labels = tf.random.normal(shape=[TRAINING_SET_SIZE, 1])

# Evaluate the model before training
model.evaluate(features_with_nans, labels, batch_size=8)

# Evaluate the model while training
model.fit(features_with_nans, labels, batch_size=8, epochs=4)

该模型是一个简单的两层序列模型，损失为MSE，训练集没有任何极值（NaN除外）。

运行损失增加的摘录：

  8/128 [>.............................] - ETA: 0s - loss: 0.4720
128/128 [==============================] - 0s 593us/sample - loss: 1.1050
Train on 128 samples
Epoch 1/4

  8/128 [>.............................] - ETA: 3s - loss: 2.3937
128/128 [==============================] - 0s 2ms/sample - loss: 1.1096
Epoch 2/4

  8/128 [>.............................] - ETA: 0s - loss: 1.1668
128/128 [==============================] - 0s 141us/sample - loss: 1.1202
Epoch 3/4

  8/128 [>.............................] - ETA: 0s - loss: 1.0059
128/128 [==============================] - 0s 141us/sample - loss: 1.1202
Epoch 4/4

  8/128 [>.............................] - ETA: 0s - loss: 1.6480
128/128 [==============================] - 0s 156us/sample - loss: 1.1202

Answer 1

一旦你的模型中有 nan，你将在梯度中有 nan，这是不可避免的。

一旦你在梯度中得到 nan，将它们相加，你将在所有模型的权重中得到 nan。

模型的权重一旦达到 nan，就无法对该模型进行任何操作。

训练后用print(model.get_weights())自己检查一下。

损失上升是因为模型突然开始只输出零（因为权重都是 nan），并且在第二遍中它不再改变。

为什么？

是的，我知道这听起来很奇怪，因为你在计算损失之前替换了 nans，但是 tensorflow 中的一些内部行为仍然会看到这些 nans - 很可能它仍在应用链式规则，它不明白当有一个零时，它应该简单地跳过所有前面的层 - 毕竟它是一台计算机，并且 zero * nan = nan.

解决方案？

如果你真的很想使用 nans（虽然这听起来不是个好主意），你必须在一开始就将它们移除。

这里有一个提议，你在一开始就去掉了 nans，然后你使用相同的 nan 掩码使 nans 的最终结果为零，并且在有 nans 的地方将标签转换为零。这样你的损失就会表现得很好：

import tensorflow.keras.backend as K

#uses a given nan mask to zero the outputs at specified places
def removeNan(x):
    t, nan_mask = x
    return tf.where(nan_mask, tf.zeros_like(t), t)


#a changed model that removes the nans at the very beginning
#later this model uses the same nan mask to zero the outputs
def get_model2() -> models.Model:
    inp = layers.Input(shape=[FEATURE_COUNT])

    #remove the nans before anything!!!! Keep the mask for applying to the outputs
    nanMask = layers.Lambda(lambda x: tf.math.is_nan(x))(inp)
    mid = layers.Lambda(removeNan)([inp, nanMask])

    mid = layers.Dense(units=64)(mid)
    mid = layers.ReLU()(mid)
    mid = layers.Dense(units=1)(mid)
    
    #apply the mask again, just to have consistent results
    out = layers.Lambda(removeNan)([mid, nanMask])
    return models.Model(inp, out)


#your features and labels
features = tf.random.normal(shape=[TRAINING_SET_SIZE, FEATURE_COUNT])
features_with_nans = tf.maximum(tf.math.log(features + 1), tf.zeros_like(features))
labels = tf.random.normal(shape=[TRAINING_SET_SIZE, 1])


#remember to make the labels have zero too, so you get a more trustable loss value:
feature_nans = 0*K.sum(features_with_nans, axis=-1, keepdims=True)
labels_with_nans = labels + feature_nans
labels_with_nans = K.switch(tf.math.is_nan(labels_with_nans), 
                            K.zeros_like(labels_with_nans), 
                            labels_with_nans)

#build new model
model = get_model2()
model.compile(
    optimizer=tf.optimizers.SGD(),
    loss=losses.mean_squared_error
)
model.summary()

#fit and check weights
model.fit(features_with_nans, labels_with_nans, batch_size=10, epochs=5)
print(model.get_weights())

Caution (must check): I read somewhere that with GPU or TPU the nans would be internally replaced with zeros to make it possible to use the hardware.

如果这是真的，您绝对应该使用其他东西而不是 nan，例如您在我建议的方法中用作掩码的 -10000 值。

为什么亏损会上升？

Why does the loss go up?

python

mathematical-optimization

keras

tensorflow

为什么？

解决方案？