CloudML 作业在使用 TensorFlow 1.9 时不会终止
CloudML jobs don't terminate when TensorFlow 1.9 is used
当使用 TF 1.9(即 officially supported)时,我们的 CloudML 训练作业不会在训练完成后终止。乔布斯只是无限期地坐在那里。有趣的是,TF 1.8 上的 CloudML 作业 运行 没有问题。我们的模型是通过 tf.Estimator
.
创建的
一个典型的日志(当使用 TF <=1.8 时)是:
I Job completed successfully.
I Finished tearing down training program.
I ps-replica-0 Clean up finished. ps-replica-0
I ps-replica-0 Module completed; cleaning up. ps-replica-0
I ps-replica-0 Signal 15 (SIGTERM) was caught. Terminated by service.
This is normal behavior. ps-replica-0
I Tearing down training program.
I master-replica-0 Task completed successfully. master-replica-0
I master-replica-0 Clean up finished. master-replica-0
I master-replica-0 Module completed; cleaning up. master-replica-0
I master-replica-0 Loss for final step: 0.054428928. master-replica-0
I master-replica-0 SavedModel written to: XXX master-replica-0
使用 TF 1.9 时,我们会看到以下内容:
I master-replica-0 Skip the current checkpoint eval due to throttle secs (30 secs). master-replica-0
I master-replica-0 Saving checkpoints for 20034 into gs://bg-dataflow/yuri/nine_gag_recommender_train_test/trained_model/model.ckpt. master-replica-0
I master-replica-0 global_step/sec: 17.7668 master-replica-0
I master-replica-0 SavedModel written to: XXX master-replica-0
有什么想法吗?
根据您发送的作业 ID 检查日志,看起来只有一半的工人完成了他们的任务,另一半被卡住了,因此 master 正在等待他们活着,这导致您的作业被卡住了。
默认情况下,使用tf.Estimator时,master 会等待所有 worker 都存活。在有很多工人的大规模分布式训练中,设置 device_filters 很重要,这样 master 只依赖 PS 存活,同样,工人也应该只依赖 PS 存活。
解决方案是在 tf.ConfigProto() 中设置您的设备过滤器并将其传递给 tf.estimator.RunConfig() 的 session_config 参数。
您可以在此处找到更多详细信息:https://cloud.google.com/ml-engine/docs/tensorflow/distributed-training-details#set-device-filters
当使用 TF 1.9(即 officially supported)时,我们的 CloudML 训练作业不会在训练完成后终止。乔布斯只是无限期地坐在那里。有趣的是,TF 1.8 上的 CloudML 作业 运行 没有问题。我们的模型是通过 tf.Estimator
.
一个典型的日志(当使用 TF <=1.8 时)是:
I Job completed successfully.
I Finished tearing down training program.
I ps-replica-0 Clean up finished. ps-replica-0
I ps-replica-0 Module completed; cleaning up. ps-replica-0
I ps-replica-0 Signal 15 (SIGTERM) was caught. Terminated by service.
This is normal behavior. ps-replica-0
I Tearing down training program.
I master-replica-0 Task completed successfully. master-replica-0
I master-replica-0 Clean up finished. master-replica-0
I master-replica-0 Module completed; cleaning up. master-replica-0
I master-replica-0 Loss for final step: 0.054428928. master-replica-0
I master-replica-0 SavedModel written to: XXX master-replica-0
使用 TF 1.9 时,我们会看到以下内容:
I master-replica-0 Skip the current checkpoint eval due to throttle secs (30 secs). master-replica-0
I master-replica-0 Saving checkpoints for 20034 into gs://bg-dataflow/yuri/nine_gag_recommender_train_test/trained_model/model.ckpt. master-replica-0
I master-replica-0 global_step/sec: 17.7668 master-replica-0
I master-replica-0 SavedModel written to: XXX master-replica-0
有什么想法吗?
根据您发送的作业 ID 检查日志,看起来只有一半的工人完成了他们的任务,另一半被卡住了,因此 master 正在等待他们活着,这导致您的作业被卡住了。
默认情况下,使用tf.Estimator时,master 会等待所有 worker 都存活。在有很多工人的大规模分布式训练中,设置 device_filters 很重要,这样 master 只依赖 PS 存活,同样,工人也应该只依赖 PS 存活。
解决方案是在 tf.ConfigProto() 中设置您的设备过滤器并将其传递给 tf.estimator.RunConfig() 的 session_config 参数。 您可以在此处找到更多详细信息:https://cloud.google.com/ml-engine/docs/tensorflow/distributed-training-details#set-device-filters