将生成器与 tf.data 一起使用的最具扩展性的方式? tf.data 指南说 `from_generator` 的可扩展性有限

Most scalable way for using generators with tf.data ? tf.data guide says `from_generator` has limited scalability

tf.data 有一个 from_generator 初始值设定项,它似乎不可扩展。来自官方指南

Caution: While this is a convienient approach it has limited portability and scalibility. It must run in the same python process that created the generator, and is still subject to the Python GIL.

https://www.tensorflow.org/guide/data#consuming_python_generators

并且在官方文档中

NOTE: The current implementation of Dataset.from_generator() uses tf.numpy_function and inherits the same constraints. In particular, it requires the Dataset- and Iterator-related operations to be placed on a device in the same process as the Python program that called Dataset.from_generator(). The body of generator will not be serialized in a GraphDef, and you should not use this method if you need to serialize your model and restore it in a different environment.

NOTE: If generator depends on mutable global variables or other external state, be aware that the runtime may invoke generator multiple times (in order to support repeating the Dataset) and at any time between the call to Dataset.from_generator() and the production of the first element from the generator. Mutating global variables or external state can cause undefined behavior, and we recommend that you explicitly cache any external state in generator before calling Dataset.from_generator().

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator

但是,生成器是对大量数据进行训练的一种相当常用的方法。所以必须有一些替代的最佳实践,但官方的 Tensorflow 数据指南没有提供任何相关信息。

遍历生成器并将数据写入 TFRecord。然后使用 TFRecordDataset。这是指南。

https://www.tensorflow.org/tutorials/load_data/tfrecord

TF 旨在通过多 GPU 有效地使用这些类型的数据集。

将数据分片到磁盘也改进了改组。