CloudML 作业在使用 TensorFlow 1.9 时不会终止

CloudML jobs don't terminate when TensorFlow 1.9 is used

当使用 TF 1.9(即 officially supported)时,我们的 CloudML 训练作业不会在训练完成后终止。乔布斯只是无限期地坐在那里。有趣的是,TF 1.8 上的 CloudML 作业 运行 没有问题。我们的模型是通过 tf.Estimator.

创建的

一个典型的日志(当使用 TF <=1.8 时)是:

I  Job completed successfully.
I  Finished tearing down training program. 
I  ps-replica-0 Clean up finished.  ps-replica-0
I  ps-replica-0 Module completed; cleaning up.  ps-replica-0
I  ps-replica-0 Signal 15 (SIGTERM) was caught. Terminated by service. 
This is normal behavior.  ps-replica-0
I  Tearing down training program. 
I  master-replica-0 Task completed successfully.  master-replica-0
I  master-replica-0 Clean up finished.  master-replica-0
I  master-replica-0 Module completed; cleaning up.  master-replica-0
I  master-replica-0 Loss for final step: 0.054428928.  master-replica-0
I  master-replica-0 SavedModel written to: XXX  master-replica-0

使用 TF 1.9 时,我们会看到以下内容:

I  master-replica-0 Skip the current checkpoint eval due to throttle secs (30 secs). master-replica-0 
I  master-replica-0 Saving checkpoints for 20034 into gs://bg-dataflow/yuri/nine_gag_recommender_train_test/trained_model/model.ckpt. master-replica-0 
I  master-replica-0 global_step/sec: 17.7668 master-replica-0 
I  master-replica-0 SavedModel written to: XXX master-replica-0 

有什么想法吗?

根据您发送的作业 ID 检查日志,看起来只有一半的工人完成了他们的任务,另一半被卡住了,因此 master 正在等待他们活着,这导致您的作业被卡住了。

默认情况下,使用tf.Estimator时,master 会等待所有 worker 都存活。在有很多工人的大规模分布式训练中,设置 device_filters 很重要,这样 master 只依赖 PS 存活,同样,工人也应该只依赖 PS 存活。

解决方案是在 tf.ConfigProto() 中设置您的设备过滤器并将其传递给 tf.estimator.RunConfig() 的 session_config 参数。 您可以在此处找到更多详细信息:https://cloud.google.com/ml-engine/docs/tensorflow/distributed-training-details#set-device-filters