tensorflow 分发 seq2seq 永远卡住
tensorflow distribute seq2seq stuck forever
我正在尝试在 Tensorflow 中启动分布式 seq2seq 模型。这是原始的单进程seq2seq模型。
我按照 tensorflow 分布式教程 here.
设置了一个集群(1ps,3workers)
但是所有worker永远卡住了,输出相同的pooling日志信息:
start running session
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 7623 get requests, put_count=3649 evicted_count=1000 eviction_rate=0.274048 and unsatisfied allocation rate=0.665617
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
这是translate.py的集群设置:
ps_hosts = ["9.91.9.129:2222"]
worker_hosts = ["9.91.9.130:2223", "9.91.9.130:2224", "9.91.9.130:2225"]
#worker_hosts = ["9.91.9.130:2223"]
cluster = tf.train.ClusterSpec({"ps":ps_hosts, "worker":worker_hosts})
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
# Worker server
is_chief = (FLAGS.task_index == 0)
gpu_num = FLAGS.task_index
with tf.Graph().as_default():
with tf.device(tf.train.replica_device_setter(cluster=cluster,
worker_device="/job:worker/task:%d/gpu:%d" % (FLAGS.task_index, gpu_num))):
我使用 tf.train.SyncReplicasOptimizer 来实现 SyncTraining。
这是我的一部分 seq2seq_model.py:
# Gradients and SGD update operation for training the model.
params = tf.trainable_variables()
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.GradientDescentOptimizer(self.learning_rate)
opt = tf.train.SyncReplicasOptimizer(
opt,
replicas_to_aggregate=num_workers,
replica_id=task_index,
total_num_replicas=num_workers)
for b in xrange(len(buckets)):
gradients = tf.gradients(self.losses[b], params)
clipped_gradients, norm = tf.clip_by_global_norm(gradients,
max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(
zip(clipped_gradients, params), global_step=self.global_step))
self.init_tokens_op = opt.get_init_tokens_op
self.chief_queue_runners = [opt.get_chief_queue_runner]
self.saver = tf.train.Saver(tf.all_variables())
这是我的完整 python 代码 [此处]
似乎 Tensorflow 人员还没有准备好适当地分享 运行ning 代码在集群上的经验。到目前为止,只能在源代码中找到全面的文档。
根据 SyncReplicasOptimizer.py 从版本 0.11 开始,您必须 运行 在 SyncReplicasOptimizer 构造之后:
init_token_op = optimizer.get_init_tokens_op()
chief_queue_runner = optimizer.get_chief_queue_runner()
然后 运行 在你的会话与 Supervisor 构建之后:
if is_chief:
sess.run(init_token_op)
sv.start_queue_runners(sess, [chief_queue_runner])
随着0.12引入的SyncReplicasOptimizerV2,此代码可能不够用,请参考您使用的版本的源代码。
我正在尝试在 Tensorflow 中启动分布式 seq2seq 模型。这是原始的单进程seq2seq模型。 我按照 tensorflow 分布式教程 here.
设置了一个集群(1ps,3workers)但是所有worker永远卡住了,输出相同的pooling日志信息:
start running session
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:244] PoolAllocator: After 7623 get requests, put_count=3649 evicted_count=1000 eviction_rate=0.274048 and unsatisfied allocation rate=0.665617
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:256] Raising pool_size_limit_ from 100 to 110
这是translate.py的集群设置:
ps_hosts = ["9.91.9.129:2222"]
worker_hosts = ["9.91.9.130:2223", "9.91.9.130:2224", "9.91.9.130:2225"]
#worker_hosts = ["9.91.9.130:2223"]
cluster = tf.train.ClusterSpec({"ps":ps_hosts, "worker":worker_hosts})
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
# Worker server
is_chief = (FLAGS.task_index == 0)
gpu_num = FLAGS.task_index
with tf.Graph().as_default():
with tf.device(tf.train.replica_device_setter(cluster=cluster,
worker_device="/job:worker/task:%d/gpu:%d" % (FLAGS.task_index, gpu_num))):
我使用 tf.train.SyncReplicasOptimizer 来实现 SyncTraining。
这是我的一部分 seq2seq_model.py:
# Gradients and SGD update operation for training the model.
params = tf.trainable_variables()
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.GradientDescentOptimizer(self.learning_rate)
opt = tf.train.SyncReplicasOptimizer(
opt,
replicas_to_aggregate=num_workers,
replica_id=task_index,
total_num_replicas=num_workers)
for b in xrange(len(buckets)):
gradients = tf.gradients(self.losses[b], params)
clipped_gradients, norm = tf.clip_by_global_norm(gradients,
max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(
zip(clipped_gradients, params), global_step=self.global_step))
self.init_tokens_op = opt.get_init_tokens_op
self.chief_queue_runners = [opt.get_chief_queue_runner]
self.saver = tf.train.Saver(tf.all_variables())
这是我的完整 python 代码 [此处]
似乎 Tensorflow 人员还没有准备好适当地分享 运行ning 代码在集群上的经验。到目前为止,只能在源代码中找到全面的文档。
根据 SyncReplicasOptimizer.py 从版本 0.11 开始,您必须 运行 在 SyncReplicasOptimizer 构造之后:
init_token_op = optimizer.get_init_tokens_op()
chief_queue_runner = optimizer.get_chief_queue_runner()
然后 运行 在你的会话与 Supervisor 构建之后:
if is_chief:
sess.run(init_token_op)
sv.start_queue_runners(sess, [chief_queue_runner])
随着0.12引入的SyncReplicasOptimizerV2,此代码可能不够用,请参考您使用的版本的源代码。