如何在tensorflow中解决"OOM when allocating tensor with shape[XXX]"（训练GCN时）

Question

所以...这个问题我查了几个post（应该还有很多没查但是我觉得现在有问题求助是合理的），但是我还没有找到适合我情况的解决方案。

这个 OOM 错误消息总是出现（没有一个例外）在第二轮的任何折叠训练循环中，并且在第一个 [=45= 之后再次重新运行训练代码时].所以这可能是与此 post: 有关的问题，但我不确定我的问题出在哪个函数上。

我的 NN 是一个具有两个图形卷积层的 GCN，我运行在具有多个 10 GB Nvidia P102-100 GPU 的服务器上编写代码。已将 batch_size 设置为 1 但没有任何变化。我也在使用 Jupyter Notebook 而不是运行ning python 脚本和命令，因为在命令行中我什至不能运行一轮...顺便说一句，有人知道为什么有些代码可以运行在命令行中弹出 OOM 时在 Jupyter 上没有问题？我觉得有点奇怪。

更新：用 GlobalMaxPool() 替换 Flatten() 后，错误消失了，我可以运行顺利地编写代码。但是，如果我再增加一个GC层，错误就会出现在第一轮。因此，我想核心问题仍然存在...

更新 2：尝试用 tf.SparseTensor 替换 tf.Tensor。成功但没有用。也尝试按照 ML_Engine 的回答中提到的设置镜像策略，但看起来其中一个 GPU 占用最高，并且 OOM 仍然出现。也许它是一种“数据并行”并且无法解决我的问题，因为我已将 batch_size 设置为 1?

代码（改编自GCNG）：

from keras import Input, Model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Flatten
from keras.optimizers import Adam
from keras.regularizers import l2
import tensorflow as tf
#from spektral.datasets import mnist
from spektral.layers import GraphConv
from spektral.layers.ops import sp_matrix_to_sp_tensor
from spektral.utils import normalized_laplacian
from keras.utils import plot_model
from sklearn import metrics
import numpy as np
import gc

l2_reg = 5e-7  # Regularization rate for l2
learning_rate = 1*1e-6  # Learning rate for SGD
batch_size = 1  # Batch size
epochs = 1 # Number of training epochs
es_patience = 50  # Patience fot early stopping

# DATA IMPORTING & PREPROCESSING OMITTED

# this part of adjacency matrix calculation is not important...
fltr = self_connection_normalized_adjacency(adj)
test = fltr.toarray()
t = tf.convert_to_tensor(test)
A_in = Input(tensor=t)
del fltr, test, t
gc.collect()


# Here comes the issue.

for test_indel in range(1,11):

    # SEVERAL LINES OMITTED (get X_train, y_train, X_val, y_val, X_test, y_test)
    
    # Build model
    N = X_train.shape[-2]  # Number of nodes in the graphs
    F = X_train.shape[-1]  # Node features dimensionality
    n_out = y_train.shape[-1]  # Dimension of the target
    X_in = Input(shape=(N, F))
    graph_conv = GraphConv(32,activation='elu',kernel_regularizer=l2(l2_reg),use_bias=True)([X_in, A_in])
    graph_conv = GraphConv(32,activation='elu',kernel_regularizer=l2(l2_reg),use_bias=True)([graph_conv, A_in])
    flatten = Flatten()(graph_conv)
    fc = Dense(512, activation='relu')(flatten)
    output = Dense(n_out, activation='sigmoid')(fc)
    model = Model(inputs=[X_in, A_in], outputs=output)
    optimizer = Adam(lr=learning_rate)
    model.compile(optimizer=optimizer,loss='binary_crossentropy',metrics=['acc'])
    model.summary()

    save_dir = current_path+'/'+str(test_indel)+'_self_connection_Ycv_LR_as_nega_rg_5-7_lr_1-6_e'+str(epochs)
    if not os.path.isdir(save_dir):
        os.makedirs(save_dir)
    early_stopping = EarlyStopping(monitor='val_acc', patience=es_patience, verbose=0, mode='auto')
    checkpoint1 = ModelCheckpoint(filepath=save_dir + '/weights.{epoch:02d}-{val_loss:.2f}.hdf5', monitor='val_loss',verbose=1, save_best_only=False, save_weights_only=False, mode='auto', period=1)
    checkpoint2 = ModelCheckpoint(filepath=save_dir + '/weights.hdf5', monitor='val_acc', verbose=1,save_best_only=True, mode='auto', period=1)
    callbacks = [checkpoint2, early_stopping]

    # Train model
    validation_data = (X_val, y_val)
    print('batch size = '+str(batch_size))
    history = model.fit(X_train,y_train,batch_size=batch_size,validation_data=validation_data,epochs=epochs,callbacks=callbacks)

    # Prediction and write-file code omitted
    del X_in, X_data_train,Y_data_train,gene_pair_index_train,count_setx_train,X_data_test, Y_data_test,gene_pair_index_test,trainX_index,validation_index,train_index, X_train, y_train, X_val, y_val, X_test, y_test, validation_data, graph_conv, flatten, fc, output, model, optimizer, history 
    gc.collect()

模型汇总:

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_2 (InputLayer)            (None, 13129, 2)     0                                            
__________________________________________________________________________________________________
input_1 (InputLayer)            (13129, 13129)       0                                            
__________________________________________________________________________________________________
graph_conv_1 (GraphConv)        (None, 13129, 32)    96          input_2[0][0]                    
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
graph_conv_2 (GraphConv)        (None, 13129, 32)    1056        graph_conv_1[0][0]               
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
flatten_1 (Flatten)             (None, 420128)       0           graph_conv_2[0][0]               
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 512)          215106048   flatten_1[0][0]                  
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1)            513         dense_1[0][0]                    
==================================================================================================
Total params: 215,107,713
Trainable params: 215,107,713
Non-trainable params: 0
__________________________________________________________________________________________________
batch size = 1

错误消息（请注意，在重启并清除输出后的第一轮中，此消息永远不会出现）：

Train on 2953 samples, validate on 739 samples
Epoch 1/1
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-5-943385df49dc> in <module>()
     62     mem = psutil.virtual_memory()
     63     print("current mem " + str(round(mem.percent))+'%')
---> 64     history = model.fit(X_train,y_train,batch_size=batch_size,validation_data=validation_data,epochs=epochs,callbacks=callbacks)
     65     mem = psutil.virtual_memory()
     66     print("current mem " + str(round(mem.percent))+'%')

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
   1237                                         steps_per_epoch=steps_per_epoch,
   1238                                         validation_steps=validation_steps,
-> 1239                                         validation_freq=validation_freq)
   1240 
   1241     def evaluate(self,

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/keras/engine/training_arrays.py in fit_loop(model, fit_function, fit_inputs, out_labels, batch_size, epochs, verbose, callbacks, val_function, val_inputs, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq)
    194                     ins_batch[i] = ins_batch[i].toarray()
    195 
--> 196                 outs = fit_function(ins_batch)
    197                 outs = to_list(outs)
    198                 for l, o in zip(out_labels, outs):

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
   3290 
   3291     fetched = self._callable_fn(*array_vals,
-> 3292                                 run_metadata=self.run_metadata)
   3293     self._call_fetch_callbacks(fetched[-len(self._fetches):])
   3294     output_structure = nest.pack_sequence_as(

/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
   1456         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1457                                                self._handle, args,
-> 1458                                                run_metadata_ptr)
   1459         if run_metadata:
   1460           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[420128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training_1/Adam/mul_23}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[metrics_1/acc/Identity/_323]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[420128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node training_1/Adam/mul_23}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Answer 1

您可以在 tensorflow 中使用分布式策略来确保正确使用您的多 GPU 设置：

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
    for test_indel in range(1,11):
         <etc>

查看文档 here

镜像策略用于在单个服务器上跨多个 GPU 进行同步分布式训练，这听起来像您正在使用的设置。还有更直观的解释in this blog.

此外，您可以尝试使用 mixed precision，它应该通过改变模型中参数的浮点类型来显着释放内存。

如何在tensorflow中解决"OOM when allocating tensor with shape[XXX]"（训练GCN时）

How to solve "OOM when allocating tensor with shape[XXX]" in tensorflow (when training a GCN)

graph

neural-network

conv-neural-network

keras

tensorflow