TF 2.0 中的神经网络：无法以 float64 精度进行训练

Question

我有一个可用的神经网络（内置于 Tensorflow 2.0 和 Keras API），我使用 float32 精度（默认精度）进行训练。现在我想以 float64 精度进行训练。在开始执行神经网络之前，我使用 tensorflow.keras.backend.set_floatx('float64) 启用它。训练开始，但在第一个时期的最后一批，我收到以下错误：

  File "Z:\Z_MASTER\DL_Reconstruction\train_stage_1.py", line 49, in train_vae
    validation_split=1/19, callbacks=callbacks) # CHANGE val split
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 728, in fit
    use_multiprocessing=use_multiprocessing)
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\keras\engine\training_arrays.py", line 674, in fit
    steps_name='steps_per_epoch')
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\keras\engine\training_arrays.py", line 449, in model_iteration
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\keras\callbacks.py", line 298, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\keras\callbacks.py", line 1614, in on_epoch_end
    self._log_weights(epoch)
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\keras\callbacks.py", line 1696, in _log_weights
    self._log_weight_as_image(weight, weight_name, epoch)
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\keras\callbacks.py", line 1721, in _log_weight_as_image
    summary_ops_v2.image(weight_name, w_img, step=epoch)
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\ops\summary_ops_v2.py", line 820, in image
    return summary_writer_function(name, tensor, function, family=family)
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\ops\summary_ops_v2.py", line 730, in summary_writer_function
    should_record_summaries(), record, _nothing, name="")
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\framework\smart_cond.py", line 54, in smart_cond
    return true_fn()
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\ops\summary_ops_v2.py", line 723, in record
    with ops.control_dependencies([function(tag, scope)]):
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\ops\summary_ops_v2.py", line 818, in function
    name=scope)
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\ops\gen_summary_ops.py", line 654, in write_image_summary
    name=name, ctx=_ctx)
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\ops\gen_summary_ops.py", line 698, in write_image_summary_eager_fallback
    attrs=_attrs, ctx=_ctx, name=name)
  File "Z:\Z_MASTER\Envs\p37_new_clone\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: Value for attr 'T' of double is not in the list of allowed values: uint8, float, half
    ; NodeDef: {{node WriteImageSummary}}; Op<name=WriteImageSummary; signature=writer:resource, step:int64, tag:string, tensor:T, bad_color:uint8 -> ; attr=max_images:int,default=3,min=1; attr=T:type,default=DT_FLOAT,allowed=[DT_UINT8, DT_FLOAT, DT_HALF]; is_stateful=true> [Op:WriteImageSummary] name: enc_0_conv/kernel_0/

Process finished with exit code 1

长话短说，我想错误消息的最后一行对查找错误最有帮助：

tensorflow.python.framework.errors_impl.InvalidArgumentError: Value for attr 'T' of double is not in the list of allowed values: uint8, float, half

我尝试通过以下方式将图层的 dtype 参数更改为 float64 来解决此问题（仅片段）：

conv = Conv2D(..., dtype='float64')(input)
...
output = ReLU(dtype='float64')(input)
...
lat_var = Lambda(... dtype='float64')([z_mean, z_log_var])
...

代码在这一行崩溃：

 history = model.fit(x=images, y=images, epochs=200, batch_size=32,
                        validation_split=1/19, callbacks=callbacks)

其中 images 是 float64 类型的 numpy 数组，由 images = images.astype('float64') 实现。

有人知道我如何训练 float64 精度吗？

Answer 1

错误的原因是两个 Tensorboard 回调，它们在每个 epoch 结束时被调用以记录训练。更具体地说，设置Tensorboard回调的参数write_images=False解决了问题。

这里是回调的完整工作代码

TensorBoard(log_dir='logs_1', profile_batch=0, histogram_freq=1, write_images=True)
TensorBoard(log_dir='current_logs', profile_batch=0, histogram_freq=1, write_images=True)

TF 2.0 中的神经网络：无法以 float64 精度进行训练

Neural network in TF 2.0: cannot train in float64 precision

neural-network

deep-learning

keras

tensorflow

tensorflow2.0