为什么亏损会上升?
Why does the loss go up?
执行下面的代码有时会导致训练过程中的损失上升,然后停留在那里。这是为什么?
import tensorflow as tf
from tensorflow.keras import layers, losses, models
FEATURE_COUNT = 2
TRAINING_SET_SIZE = 128
def patch_nans(t: tf.Tensor) -> tf.Tensor:
""":return t with nans replaced by zeros"""
nan_mask = tf.math.is_nan(t)
return tf.where(nan_mask, tf.zeros_like(t), t)
def check_numerics(t: tf.Tensor) -> tf.Tensor:
"""Throw an exception if t contains nans."""
return tf.debugging.check_numerics(t, "t")
def get_model() -> models.Model:
inp = layers.Input(shape=[FEATURE_COUNT])
mid = layers.Dense(units=64)(inp)
mid = layers.ReLU()(mid)
mid = layers.Dense(units=1)(mid)
mid = layers.Lambda(patch_nans)(mid)
out = layers.Lambda(check_numerics)(mid)
return models.Model(inp, out)
model = get_model()
model.compile(
optimizer=tf.optimizers.SGD(),
loss=losses.mean_squared_error
)
model.summary()
features = tf.random.normal(shape=[TRAINING_SET_SIZE, FEATURE_COUNT])
features_with_nans = tf.maximum(tf.math.log(features + 1), tf.zeros_like(features))
labels = tf.random.normal(shape=[TRAINING_SET_SIZE, 1])
# Evaluate the model before training
model.evaluate(features_with_nans, labels, batch_size=8)
# Evaluate the model while training
model.fit(features_with_nans, labels, batch_size=8, epochs=4)
该模型是一个简单的两层序列模型,损失为MSE,训练集没有任何极值(NaN除外)。
运行 损失增加的摘录:
8/128 [>.............................] - ETA: 0s - loss: 0.4720
128/128 [==============================] - 0s 593us/sample - loss: 1.1050
Train on 128 samples
Epoch 1/4
8/128 [>.............................] - ETA: 3s - loss: 2.3937
128/128 [==============================] - 0s 2ms/sample - loss: 1.1096
Epoch 2/4
8/128 [>.............................] - ETA: 0s - loss: 1.1668
128/128 [==============================] - 0s 141us/sample - loss: 1.1202
Epoch 3/4
8/128 [>.............................] - ETA: 0s - loss: 1.0059
128/128 [==============================] - 0s 141us/sample - loss: 1.1202
Epoch 4/4
8/128 [>.............................] - ETA: 0s - loss: 1.6480
128/128 [==============================] - 0s 156us/sample - loss: 1.1202
一旦你的模型中有 nan
,你将在梯度中有 nan
,这是不可避免的。
一旦你在梯度中得到 nan
,将它们相加,你将在所有模型的权重中得到 nan
。
模型的权重一旦达到 nan
,就无法对该模型进行任何操作。
训练后用print(model.get_weights())
自己检查一下。
损失上升是因为模型突然开始只输出零(因为权重都是 nan
),并且在第二遍中它不再改变。
为什么?
是的,我知道这听起来很奇怪,因为你在计算损失之前替换了 nans,但是 tensorflow 中的一些内部行为仍然会看到这些 nans - 很可能它仍在应用链式规则,它不明白当有一个零时,它应该简单地跳过所有前面的层 - 毕竟它是一台计算机,并且 zero * nan = nan
.
解决方案?
如果你真的很想使用 nans(虽然这听起来不是个好主意),你必须在一开始就将它们移除。
这里有一个提议,你在一开始就去掉了 nans,然后你使用相同的 nan 掩码使 nans 的最终结果为零,并且在有 nans 的地方将标签转换为零。这样你的损失就会表现得很好:
import tensorflow.keras.backend as K
#uses a given nan mask to zero the outputs at specified places
def removeNan(x):
t, nan_mask = x
return tf.where(nan_mask, tf.zeros_like(t), t)
#a changed model that removes the nans at the very beginning
#later this model uses the same nan mask to zero the outputs
def get_model2() -> models.Model:
inp = layers.Input(shape=[FEATURE_COUNT])
#remove the nans before anything!!!! Keep the mask for applying to the outputs
nanMask = layers.Lambda(lambda x: tf.math.is_nan(x))(inp)
mid = layers.Lambda(removeNan)([inp, nanMask])
mid = layers.Dense(units=64)(mid)
mid = layers.ReLU()(mid)
mid = layers.Dense(units=1)(mid)
#apply the mask again, just to have consistent results
out = layers.Lambda(removeNan)([mid, nanMask])
return models.Model(inp, out)
#your features and labels
features = tf.random.normal(shape=[TRAINING_SET_SIZE, FEATURE_COUNT])
features_with_nans = tf.maximum(tf.math.log(features + 1), tf.zeros_like(features))
labels = tf.random.normal(shape=[TRAINING_SET_SIZE, 1])
#remember to make the labels have zero too, so you get a more trustable loss value:
feature_nans = 0*K.sum(features_with_nans, axis=-1, keepdims=True)
labels_with_nans = labels + feature_nans
labels_with_nans = K.switch(tf.math.is_nan(labels_with_nans),
K.zeros_like(labels_with_nans),
labels_with_nans)
#build new model
model = get_model2()
model.compile(
optimizer=tf.optimizers.SGD(),
loss=losses.mean_squared_error
)
model.summary()
#fit and check weights
model.fit(features_with_nans, labels_with_nans, batch_size=10, epochs=5)
print(model.get_weights())
Caution (must check): I read somewhere that with GPU or TPU the nans would be internally replaced with zeros to make it possible to use the hardware.
如果这是真的,您绝对应该使用其他东西而不是 nan
,例如您在我建议的方法中用作掩码的 -10000
值。
执行下面的代码有时会导致训练过程中的损失上升,然后停留在那里。这是为什么?
import tensorflow as tf
from tensorflow.keras import layers, losses, models
FEATURE_COUNT = 2
TRAINING_SET_SIZE = 128
def patch_nans(t: tf.Tensor) -> tf.Tensor:
""":return t with nans replaced by zeros"""
nan_mask = tf.math.is_nan(t)
return tf.where(nan_mask, tf.zeros_like(t), t)
def check_numerics(t: tf.Tensor) -> tf.Tensor:
"""Throw an exception if t contains nans."""
return tf.debugging.check_numerics(t, "t")
def get_model() -> models.Model:
inp = layers.Input(shape=[FEATURE_COUNT])
mid = layers.Dense(units=64)(inp)
mid = layers.ReLU()(mid)
mid = layers.Dense(units=1)(mid)
mid = layers.Lambda(patch_nans)(mid)
out = layers.Lambda(check_numerics)(mid)
return models.Model(inp, out)
model = get_model()
model.compile(
optimizer=tf.optimizers.SGD(),
loss=losses.mean_squared_error
)
model.summary()
features = tf.random.normal(shape=[TRAINING_SET_SIZE, FEATURE_COUNT])
features_with_nans = tf.maximum(tf.math.log(features + 1), tf.zeros_like(features))
labels = tf.random.normal(shape=[TRAINING_SET_SIZE, 1])
# Evaluate the model before training
model.evaluate(features_with_nans, labels, batch_size=8)
# Evaluate the model while training
model.fit(features_with_nans, labels, batch_size=8, epochs=4)
该模型是一个简单的两层序列模型,损失为MSE,训练集没有任何极值(NaN除外)。
运行 损失增加的摘录:
8/128 [>.............................] - ETA: 0s - loss: 0.4720
128/128 [==============================] - 0s 593us/sample - loss: 1.1050
Train on 128 samples
Epoch 1/4
8/128 [>.............................] - ETA: 3s - loss: 2.3937
128/128 [==============================] - 0s 2ms/sample - loss: 1.1096
Epoch 2/4
8/128 [>.............................] - ETA: 0s - loss: 1.1668
128/128 [==============================] - 0s 141us/sample - loss: 1.1202
Epoch 3/4
8/128 [>.............................] - ETA: 0s - loss: 1.0059
128/128 [==============================] - 0s 141us/sample - loss: 1.1202
Epoch 4/4
8/128 [>.............................] - ETA: 0s - loss: 1.6480
128/128 [==============================] - 0s 156us/sample - loss: 1.1202
一旦你的模型中有 nan
,你将在梯度中有 nan
,这是不可避免的。
一旦你在梯度中得到 nan
,将它们相加,你将在所有模型的权重中得到 nan
。
模型的权重一旦达到 nan
,就无法对该模型进行任何操作。
训练后用print(model.get_weights())
自己检查一下。
损失上升是因为模型突然开始只输出零(因为权重都是 nan
),并且在第二遍中它不再改变。
为什么?
是的,我知道这听起来很奇怪,因为你在计算损失之前替换了 nans,但是 tensorflow 中的一些内部行为仍然会看到这些 nans - 很可能它仍在应用链式规则,它不明白当有一个零时,它应该简单地跳过所有前面的层 - 毕竟它是一台计算机,并且 zero * nan = nan
.
解决方案?
如果你真的很想使用 nans(虽然这听起来不是个好主意),你必须在一开始就将它们移除。
这里有一个提议,你在一开始就去掉了 nans,然后你使用相同的 nan 掩码使 nans 的最终结果为零,并且在有 nans 的地方将标签转换为零。这样你的损失就会表现得很好:
import tensorflow.keras.backend as K
#uses a given nan mask to zero the outputs at specified places
def removeNan(x):
t, nan_mask = x
return tf.where(nan_mask, tf.zeros_like(t), t)
#a changed model that removes the nans at the very beginning
#later this model uses the same nan mask to zero the outputs
def get_model2() -> models.Model:
inp = layers.Input(shape=[FEATURE_COUNT])
#remove the nans before anything!!!! Keep the mask for applying to the outputs
nanMask = layers.Lambda(lambda x: tf.math.is_nan(x))(inp)
mid = layers.Lambda(removeNan)([inp, nanMask])
mid = layers.Dense(units=64)(mid)
mid = layers.ReLU()(mid)
mid = layers.Dense(units=1)(mid)
#apply the mask again, just to have consistent results
out = layers.Lambda(removeNan)([mid, nanMask])
return models.Model(inp, out)
#your features and labels
features = tf.random.normal(shape=[TRAINING_SET_SIZE, FEATURE_COUNT])
features_with_nans = tf.maximum(tf.math.log(features + 1), tf.zeros_like(features))
labels = tf.random.normal(shape=[TRAINING_SET_SIZE, 1])
#remember to make the labels have zero too, so you get a more trustable loss value:
feature_nans = 0*K.sum(features_with_nans, axis=-1, keepdims=True)
labels_with_nans = labels + feature_nans
labels_with_nans = K.switch(tf.math.is_nan(labels_with_nans),
K.zeros_like(labels_with_nans),
labels_with_nans)
#build new model
model = get_model2()
model.compile(
optimizer=tf.optimizers.SGD(),
loss=losses.mean_squared_error
)
model.summary()
#fit and check weights
model.fit(features_with_nans, labels_with_nans, batch_size=10, epochs=5)
print(model.get_weights())
Caution (must check): I read somewhere that with GPU or TPU the nans would be internally replaced with zeros to make it possible to use the hardware.
如果这是真的,您绝对应该使用其他东西而不是 nan
,例如您在我建议的方法中用作掩码的 -10000
值。