如何使数据集管道具有分布式读取和消费

Question

在 1.2.0 之后很容易使用两个线程将数据作为输入管道输入 queue and the other consumes data from the queue and perform the computation. Since the TensorFlow recommends Dataset，我想使用 Dataset 及其 iterator完成上述任务，即：

有两个进程，一个提供，另一个消耗；
管道在满或空时挂起，并在计算完成消耗时停止。

P.S。为什么在Threading and Queues的教程中，TensorFlow使用thread而不是process？

提前致谢。

Answer 1

TensorFlow 1.3 尚不支持分布式 tf.contrib.data 管道。我们正在努力支持跨设备 and/or 进程拆分数据集，但该支持尚未准备就绪。

同时，实现目标的最简单方法是使用 tf.FIFOQueue。您可以定义一个从队列中读取的 Dataset，如下所示：

q = tf.FIFOQueue(...)

# Define a dummy dataset that contains the same value repeated indefinitely.
dummy = tf.contrib.data.Dataset.from_tensors(0).repeat(None)

dataset_from_queue = dummy.map(lambda _: q.dequeue())

然后您可以使用 dataset_from_queue 组合其他 Dataset 转换。

如何使数据集管道具有分布式读取和消费

How to enable Dataset pipeline has distributed reading and consuming

python

machine-learning

distributed-computing

tensorflow