分配具有形状的张量时出现 OOM - 如何获得更多 GPU 内存

Question

[运行在 Jupyter 实验室环境中] 在 tensorflow 上训练我的 CNN 时：

 history = model.fit(
        train_generator,
        steps_per_epoch=3,
        epochs=5,
        verbose = 1,

当我运行我的算法时，我得到一个 'OOM when allocating tensor with shape'。

据我了解，这意味着我没有运行消耗足够的 GPU 内存。我如何连接 Jupyter 上的服务器以访问更多内存以用于运行我的训练 NN？

我正在使用以下包和代码来加载图像：

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Conduct pre-processing on the data to read and feed the images from the directories into the CNN

# Re-scale data as pixels have value of 0-255
train_datagen = ImageDataGenerator(rescale=1/255)
validation_datagen = ImageDataGenerator(rescale=1/255)

# Feed training dataset images in via batches of 250
train_generator = train_datagen.flow_from_directory (
    'Users\cats-or-dogs\PetImages', # Directory with training set images
    target_size=(300, 300), # Re-size target images
    batch_size = 425, #mini-batch of 250 to make CNN more efficient
    class_mode = 'binary'
)

Answer 1

请告诉我它是否有效。通常我们可以在导入必要的包后启用 mixed-precision，如下所示。它允许更快的计算并消耗更少的 GPU 内存。因此，我们也可以增加批量大小。但是硬件应该支持这样的设施，所以请先检查它们。 Keras mixed-precision (mp) API 在 TensorFlow 2.x 中可用。 开个玩笑，如果你想获得更多的 GPU 显存，那就添加更多的 GPU。因此，您将进行 multi-gpu 训练。但是要使用单个 gpu，mp 是技巧之一。否则，减小批量大小可能会解决 OOM 问题。

policy = tf.keras.mixed_precision.experimental.Policy('mixed_float16')
tf.keras.mixed_precision.experimental.set_policy(policy)

引用自官方文档。 在 GPU 上使用 mixed-precision 时的性能提示。

增加批量大小

如果不影响模型质量，尝试运行 double batch size when using mixed-precision.由于 float16 张量使用 一半的内存 ，这通常允许您 将批量大小增加一倍而不会运行内存不足 .增加批量大小通常会增加训练吞吐量，即您的模型每秒可以运行训练元素。

此外，我们还可以在每个 epoch 之后使用 gc.collect() 来收集将释放一些内存的垃圾 space，见下文。还有 del 可能消耗合理内存的未使用大变量 space.

import tensorflow as tf
import gc

class RemoveGarbaseCallback(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs=None):
    gc.collect()
...
...
model.fit(train_generator, ...
callbacks=[RemoveGarbaseCallback()])

然而，我们可以在使用tf.keras的同时使用clear_session()，这将清理所有内容。这是推荐如果我们在循环内创建模型。因此我们可以在每次迭代中使用以下代码片段。

for _ in range(no_of_iteration):
   # With `clear_session()` called at the beginning,
   # Keras starts with a blank state at each iteration
   # and memory consumption is constant over time.
   tf.keras.backend.clear_session() # Resets all state generated by Keras

   train_generator = ...
   valid_generator = ...
   
   model =  create_model()
   history = model.fit(.., callbacks=[RemoveGarbaseCallback()])

   # free up some memory space
   del model
   del train_set, valid_set

更新

如你所见：

UnidentifiedImageError: 
cannot identify image file <_io.BytesIO object at 0x0000019F9BC1E950>

训练目录中可能有一些un-supported个文件时会发生这种情况。要检查文件格式，运行以下函数：

from collections import Counter
import os
def IMG_EXTENTION(img_path):
    extension_type = []
    file_list = os.listdir(img_path)
    
    for file in file_list: extension_type.append(file.rsplit(".", 1)[1].lower())
        
    print(Counter(extension_type).keys())
    print(Counter(extension_type).values())
    
train_dir = './images' # directory that contains training samples 
IMG_EXTENTION(img_path=train_dir)

在这种情况下，作为方面，它应该包含图像文件格式，即：jpg、jpeg、png 等。现在问题是在处理 jupyter 环境，它会自动保存 .ipynb 检查点。因此，在您的案例中，它可能与其他图像文件一起保存到训练目录中。那是不支持的。在这种情况下，您所要做的就是更改项目目录或更改保存位置。一些指针：1, 2

如果您使用的是自定义数据生成器，我建议您使用try和except来绕过不受支持的文件。同样在 flow_from_dataframe 而不是 flow_from_directory 中，我们可以具体传递 x_col="id" 和 y_col="label"，在这种情况下我们可能不会遇到这样的问题。

分配具有形状的张量时出现 OOM - 如何获得更多 GPU 内存

OOM when allocating tensor with shape - how to get more GPU memory

neural-network

jupyter

keras

tensorflow

jupyter-notebook

增加批量大小

更新