Google ml-engine：需要很长时间才能填满队列

Question

我创建了 tf records 个文件，这些文件存储在 google 个存储桶中。我在 ml-engine 上有一个代码运行可以使用这些 tf records

中的数据训练模型

每个 tf 记录文件包含一批 20 个示例，大小约为 8Mb（Mega 字节）。存储桶中有数千个文件。

我的问题是开始训练几乎要花很长时间。从加载包裹的那一刻到训练真正开始的那一刻，我必须等待大约 40 分钟。我猜这是下载数据和填充队列所需的时间？

代码为（为简洁起见略作简化）：

    # Create a queue which will produce tf record names
    filename_queue = tf.train.string_input_producer(files, num_epochs=num_epochs, capacity=100)

    # Read the record
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)

    # Map for decoding the serialized example
    features = tf.parse_single_example(
        serialized_example,
        features={
            'data': tf.FixedLenFeature([], tf.float32),
            'label': tf.FixedLenFeature([], tf.int64)
        })

    train_tensors = tf.train.shuffle_batch(
        [features['data'], features['label']],
        batch_size=30,
        capacity=600,
        min_after_dequeue=400,
        allow_smaller_final_batch=True
        enqueue_many=True)

我检查过我的存储桶和我的作业共享相同的 region 参数。

我不明白为什么要花这么长时间：应该只是下载几百 Mbs 的问题（几十个 tf 记录文件应该足以包含超过 min_after_dequeue 个元素队列）。

知道我遗漏了什么，或者问题出在哪里吗？

谢谢

Answer 1

对不起，我的错。我正在使用自定义函数来：

验证作为 tf 记录传递的每个文件确实存在。
扩展通配符，如果有的话

事实证明，在 gs://

上处理数千个文件时，这是一个非常的坏主意

我已经删除了这个 "sanity" 检查，它现在工作正常。

Google ml-engine：需要很长时间才能填满队列

Google ml-engine: takes forever to fill the queue

google-cloud-ml