Solving SVHN using Tensorflow Error: "Resource exhausted: OOM when allocating tensor.."

Question

我试图使用此处提供的卷积神经网络解决 "SVHN" 数据集分类问题 https://www.tensorflow.org/versions/0.6.0/tutorials/deep_cnn/index.html#convolutional-neural-networks

我是这样读取数据并格式化的：

read_input = scipy.io.loadmat('data/train_32x32.mat')
converted_label = tf.cast(read_input['y'], tf.int32)
converted_image = tf.cast(read_input['X'], tf.float32)
reshaped_image = tf.transpose(converted_image, [3, 0, 1, 2])

在_generate_image_and_label_batch函数中，我稍微修改了代码，因为train_32X32.mat和text_32X32.mat中的输入图像已经是4D格式了。

images, label_batch = tf.train.shuffle_batch(
      [image, label],
      batch_size=FLAGS.batch_size,
      enqueue_many=True,
      num_threads=num_preprocess_threads,
      capacity=min_queue_examples + 3 * FLAGS.batch_size,
      min_after_dequeue=min_queue_examples)

我遇到了这些错误：

Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 4
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 4
W tensorflow/core/kernels/cast_op.cc:66] Resource exhausted: OOM when allocating tensor with shapedim { size: 32 } dim { size: 32 } dim { size: 3 } dim { size: 73257 }
W tensorflow/core/common_runtime/executor.cc:1027] 0x7f1c180015a0 Compute status: Resource exhausted: OOM when allocating tensor with shapedim { size: 32 } dim { size: 32 } dim { size: 3 } dim { size: 73257 }
     [[Node: Cast_1 = Cast[DstT=DT_FLOAT, SrcT=DT_UINT8, _device="/job:localhost/replica:0/task:0/cpu:0"](Cast_1/x)]]
W tensorflow/core/kernels/cast_op.cc:66] Resource exhausted: OOM when allocating tensor with shapedim { size: 32 } dim { size: 32 } dim { size: 3 } dim { size: 73257 }
W tensorflow/core/common_runtime/executor.cc:1027] 0x7f1c280ea810 Compute status: Resource exhausted: OOM when allocating tensor with shapedim { size: 32 } dim { size: 32 } dim { size: 3 } dim { size: 73257 }
     [[Node: Cast_1 = Cast[DstT=DT_FLOAT, SrcT=DT_UINT8, _device="/job:localhost/replica:0/task:0/cpu:0"](Cast_1/x)]]
Killed

如果我在任何逻辑上有任何错误，请告诉我。

谢谢

莎拉

Answer 1

请注意，您的数据包含 2*32*3*73257 个条目，浮点数为 900 MB，双精度数为 1800MB。因此，您在 read_input['X'] 处分配了 1800MB，然后 TF 将其转换为张量以馈入 cast，这是另外 900MB。 tf.cast的输出是另一个900MB的张量，transpose的输出是另一个900MB的张量。

因此您可能需要 4.5GB 的内存才能运行。

一般来说，这种方法（转换为 Constant 节点）仅推荐用于 "small" 问题。有一个 2GB 的硬性限制，您可以将其放入常量中，但如果您移动到 GPU，甚至更小的值（即 >100MB）也可能会导致问题（示例 here）

另一种可扩展的方法是使用 Cifar 示例中的输入管道

Solving SVHN using Tensorflow Error: "Resource exhausted: OOM when allocating tensor.."

Solving SVHN using Tensorflow Error: "Resource exhausted: OOM when allocating tensor.."

linux

deep-learning

tensorflow