Keras 训练适用于 binary_crossentropy 但不适用于 categorcial_crossentropy

Question

这个问题不是以下问题的重复。

Moved to Tensorflow 2.0, training now hangs after third step

分解我所做的事情和发生的事情：

下面运行一个简单的CNN时，给出如下输出，并出现错误。请注意，我已经有了带有 tensorflow.keras.utils.to_categorical 的单热编码标签，因此应该没有错误。

import numpy as np
import tensorflow.keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.optimizers import Nadam
from tensorflow.keras.layers import Conv2D, Dense, Flatten

# image dimensions
img_rows, img_cols = 28, 28
num_classes = 10

# the data, split between train and test sets
(x, y), (x_val, y_val) = mnist.load_data()

# float32 for the model
x = x.astype('float32')
x_val = x_val.astype('float32')

print('x:', x.shape)
print('y:', y.shape)
print('x_val:', x_val.shape)
print('y_val:', y_val.shape)

# reshape into required dimensions
x = x.reshape(x.shape[0], img_rows, img_cols, 1)
x_val = x_val.reshape(x_val.shape[0], img_rows, img_cols, 1)

# convert class vectors to binary class matrices
y = tensorflow.keras.utils.to_categorical(y, num_classes)
y_val = tensorflow.keras.utils.to_categorical(y_val, num_classes)

print('Convert class vectors to binary class matrices: 1 becomes {}'.format(y[1]))
print('y:', y.shape)
print('y_val:', y_val.shape)


input_shape = x[0].shape

model = Sequential()

model.add(Conv2D(4, 3, 1, padding='same', input_shape=input_shape, activation='relu'))
model.add(Flatten())
model.add(Dense(10, activation='softmax'))

model.compile(loss=tensorflow.keras.losses.categorical_crossentropy,
              optimizer=Nadam(),
              metrics=['acc'])

history = model.fit(x, y, 
                        validation_data=[x_val, y_val],
                        batch_size=128,
                        epochs=100,
                        verbose=1)

输出：

注意 CNN 在 36096/60000 实时停止训练。换句话说，它并没有因为粘贴代码而卡在36096。

x: (60000, 28, 28)
y: (60000,)
x_val: (10000, 28, 28)
y_val: (10000,)
Convert class vectors to binary class matrices: 1 becomes [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
y: (60000, 10)
y_val: (10000, 10)
Train on 60000 samples, validate on 10000 samples
Epoch 1/10
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
60000/60000 [==============================] - 3s 55us/sample - loss: 4.9133 - acc: 0.8514 - val_loss: 1.5114 - val_acc: 0.8999
Epoch 2/10
36096/60000 [=================>............] - ETA: 0s - loss: 0.9090 - acc: 0.9264

错误：

2019-12-10 13:13:48.694128: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.

另一方面

当我将损失从 categorical_crossentropy 更改为 binary_crossentropy 时，一切正常。

变化：

model.compile(loss=tensorflow.keras.losses.binary_crossentropy,
              optimizer=Nadam(),
              metrics=['acc'])

新输出：

请注意，现在 CNN 运行顺利；某个样品没有冻结。也没有像以前那样的 ptxas 错误。

x: (60000, 28, 28)
y: (60000,)
x_val: (10000, 28, 28)
y_val: (10000,)
Convert class vectors to binary class matrices: 1 becomes [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
y: (60000, 10)
y_val: (10000, 10)
Train on 60000 samples, validate on 10000 samples
Epoch 1/10
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
60000/60000 [==============================] - 3s 58us/sample - loss: 0.2263 - acc: 0.9745 - val_loss: 0.0710 - val_acc: 0.9891
Epoch 2/10
60000/60000 [==============================] - 2s 29us/sample - loss: 0.0460 - acc: 0.9917 - val_loss: 0.0434 - val_acc: 0.9914
Epoch 3/10
60000/60000 [==============================] - 2s 29us/sample - loss: 0.0240 - acc: 0.9943 - val_loss: 0.0370 - val_acc: 0.9918
Epoch 4/10
60000/60000 [==============================] - 2s 29us/sample - loss: 0.0158 - acc: 0.9958 - val_loss: 0.0283 - val_acc: 0.9932
Epoch 5/10
60000/60000 [==============================] - 2s 29us/sample - loss: 0.0120 - acc: 0.9964 - val_loss: 0.0301 - val_acc: 0.9926
Epoch 6/10
60000/60000 [==============================] - 2s 30us/sample - loss: 0.0094 - acc: 0.9971 - val_loss: 0.0301 - val_acc: 0.9931
Epoch 7/10
60000/60000 [==============================] - 2s 29us/sample - loss: 0.0084 - acc: 0.9974 - val_loss: 0.0310 - val_acc: 0.9932
Epoch 8/10
60000/60000 [==============================] - 2s 29us/sample - loss: 0.0078 - acc: 0.9976 - val_loss: 0.0303 - val_acc: 0.9933
Epoch 9/10
60000/60000 [==============================] - 2s 30us/sample - loss: 0.0074 - acc: 0.9977 - val_loss: 0.0312 - val_acc: 0.9928
Epoch 10/10
60000/60000 [==============================] - 2s 30us/sample - loss: 0.0069 - acc: 0.9979 - val_loss: 0.0308 - val_acc: 0.9931

问题：

我是不是做错了什么？
categorical_crossentropy有问题吗？

我已经尝试重新运行多次，得到相同的结果。

系统信息：

tensorflow-gpu 2.0.0
keras-gpu 2.2.4
Cuda compilation tools, release 10.0, V10.0.130
cuDNN 7.4.02
Windows 10
python 3.6.8

更新：

我尝试了 saurjog 建议的方法，但问题仍然存在。 CNN 的训练在 categorical_crossentropy 下仍然冻结，但在 biniary_crossentropy.

下工作正常

我试过的版本：

TF 1.12.0/CUDA 9.0/cuDNN 7.3.1.20
TF 1.14/CUDA 10.0/cuDNN 7.4.0.20

Answer 1

This Github issue 可能会有帮助，尽管它仍然处于打开状态。看起来问题并不特定于您问题中的损失函数。根据 Github 问题讨论线程，以下是似乎有效的 Keras/cuDNN/TF 版本：

Tensorflow 1.12.0/CUDA 9.0/cuDNN 7.3.1.20
张量流 1.14/CUDA 10.0

Keras 训练适用于 binary_crossentropy 但不适用于 categorcial_crossentropy

Keras training works on binary_crossentropy but not categorcial_crossentropy

python-3.x

conv-neural-network

tensorflow

tf.keras

tensorflow2.0