如何在tensorflow中解决"OOM when allocating tensor with shape[XXX]"(训练GCN时)
How to solve "OOM when allocating tensor with shape[XXX]" in tensorflow (when training a GCN)
所以...这个问题我查了几个post(应该还有很多没查但是我觉得现在有问题求助是合理的),但是我还没有找到适合我情况的解决方案。
这个 OOM 错误消息总是出现(没有一个例外)在第二轮的任何折叠训练循环中,并且在第一个 [=45= 之后再次重新运行训练代码时].所以这可能是与此 post: 有关的问题,但我不确定我的问题出在哪个函数上。
我的 NN 是一个具有两个图形卷积层的 GCN,我 运行 在具有多个 10 GB Nvidia P102-100 GPU 的服务器上编写代码。已将 batch_size 设置为 1 但没有任何变化。我也在使用 Jupyter Notebook 而不是 运行ning python 脚本和命令,因为在命令行中我什至不能 运行 一轮...顺便说一句,有人知道为什么有些代码可以 运行 在命令行中弹出 OOM 时在 Jupyter 上没有问题?我觉得有点奇怪。
更新:用 GlobalMaxPool() 替换 Flatten() 后,错误消失了,我可以 运行 顺利地编写代码。但是,如果我再增加一个GC层,错误就会出现在第一轮。因此,我想核心问题仍然存在...
更新 2:尝试用 tf.SparseTensor
替换 tf.Tensor
。成功但没有用。也尝试按照 ML_Engine 的回答中提到的设置镜像策略,但看起来其中一个 GPU 占用最高,并且 OOM 仍然出现。也许它是一种“数据并行”并且无法解决我的问题,因为我已将 batch_size
设置为 1?
代码(改编自GCNG):
from keras import Input, Model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Flatten
from keras.optimizers import Adam
from keras.regularizers import l2
import tensorflow as tf
#from spektral.datasets import mnist
from spektral.layers import GraphConv
from spektral.layers.ops import sp_matrix_to_sp_tensor
from spektral.utils import normalized_laplacian
from keras.utils import plot_model
from sklearn import metrics
import numpy as np
import gc
l2_reg = 5e-7 # Regularization rate for l2
learning_rate = 1*1e-6 # Learning rate for SGD
batch_size = 1 # Batch size
epochs = 1 # Number of training epochs
es_patience = 50 # Patience fot early stopping
# DATA IMPORTING & PREPROCESSING OMITTED
# this part of adjacency matrix calculation is not important...
fltr = self_connection_normalized_adjacency(adj)
test = fltr.toarray()
t = tf.convert_to_tensor(test)
A_in = Input(tensor=t)
del fltr, test, t
gc.collect()
# Here comes the issue.
for test_indel in range(1,11):
# SEVERAL LINES OMITTED (get X_train, y_train, X_val, y_val, X_test, y_test)
# Build model
N = X_train.shape[-2] # Number of nodes in the graphs
F = X_train.shape[-1] # Node features dimensionality
n_out = y_train.shape[-1] # Dimension of the target
X_in = Input(shape=(N, F))
graph_conv = GraphConv(32,activation='elu',kernel_regularizer=l2(l2_reg),use_bias=True)([X_in, A_in])
graph_conv = GraphConv(32,activation='elu',kernel_regularizer=l2(l2_reg),use_bias=True)([graph_conv, A_in])
flatten = Flatten()(graph_conv)
fc = Dense(512, activation='relu')(flatten)
output = Dense(n_out, activation='sigmoid')(fc)
model = Model(inputs=[X_in, A_in], outputs=output)
optimizer = Adam(lr=learning_rate)
model.compile(optimizer=optimizer,loss='binary_crossentropy',metrics=['acc'])
model.summary()
save_dir = current_path+'/'+str(test_indel)+'_self_connection_Ycv_LR_as_nega_rg_5-7_lr_1-6_e'+str(epochs)
if not os.path.isdir(save_dir):
os.makedirs(save_dir)
early_stopping = EarlyStopping(monitor='val_acc', patience=es_patience, verbose=0, mode='auto')
checkpoint1 = ModelCheckpoint(filepath=save_dir + '/weights.{epoch:02d}-{val_loss:.2f}.hdf5', monitor='val_loss',verbose=1, save_best_only=False, save_weights_only=False, mode='auto', period=1)
checkpoint2 = ModelCheckpoint(filepath=save_dir + '/weights.hdf5', monitor='val_acc', verbose=1,save_best_only=True, mode='auto', period=1)
callbacks = [checkpoint2, early_stopping]
# Train model
validation_data = (X_val, y_val)
print('batch size = '+str(batch_size))
history = model.fit(X_train,y_train,batch_size=batch_size,validation_data=validation_data,epochs=epochs,callbacks=callbacks)
# Prediction and write-file code omitted
del X_in, X_data_train,Y_data_train,gene_pair_index_train,count_setx_train,X_data_test, Y_data_test,gene_pair_index_test,trainX_index,validation_index,train_index, X_train, y_train, X_val, y_val, X_test, y_test, validation_data, graph_conv, flatten, fc, output, model, optimizer, history
gc.collect()
模型汇总:
Model: "model_1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_2 (InputLayer) (None, 13129, 2) 0
__________________________________________________________________________________________________
input_1 (InputLayer) (13129, 13129) 0
__________________________________________________________________________________________________
graph_conv_1 (GraphConv) (None, 13129, 32) 96 input_2[0][0]
input_1[0][0]
__________________________________________________________________________________________________
graph_conv_2 (GraphConv) (None, 13129, 32) 1056 graph_conv_1[0][0]
input_1[0][0]
__________________________________________________________________________________________________
flatten_1 (Flatten) (None, 420128) 0 graph_conv_2[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 512) 215106048 flatten_1[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 1) 513 dense_1[0][0]
==================================================================================================
Total params: 215,107,713
Trainable params: 215,107,713
Non-trainable params: 0
__________________________________________________________________________________________________
batch size = 1
错误消息(请注意,在重启并清除输出后的第一轮中,此消息永远不会出现):
Train on 2953 samples, validate on 739 samples
Epoch 1/1
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
<ipython-input-5-943385df49dc> in <module>()
62 mem = psutil.virtual_memory()
63 print("current mem " + str(round(mem.percent))+'%')
---> 64 history = model.fit(X_train,y_train,batch_size=batch_size,validation_data=validation_data,epochs=epochs,callbacks=callbacks)
65 mem = psutil.virtual_memory()
66 print("current mem " + str(round(mem.percent))+'%')
/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
1237 steps_per_epoch=steps_per_epoch,
1238 validation_steps=validation_steps,
-> 1239 validation_freq=validation_freq)
1240
1241 def evaluate(self,
/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/keras/engine/training_arrays.py in fit_loop(model, fit_function, fit_inputs, out_labels, batch_size, epochs, verbose, callbacks, val_function, val_inputs, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq)
194 ins_batch[i] = ins_batch[i].toarray()
195
--> 196 outs = fit_function(ins_batch)
197 outs = to_list(outs)
198 for l, o in zip(out_labels, outs):
/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
3290
3291 fetched = self._callable_fn(*array_vals,
-> 3292 run_metadata=self.run_metadata)
3293 self._call_fetch_callbacks(fetched[-len(self._fetches):])
3294 output_structure = nest.pack_sequence_as(
/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
1456 ret = tf_session.TF_SessionRunCallable(self._session._session,
1457 self._handle, args,
-> 1458 run_metadata_ptr)
1459 if run_metadata:
1460 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[420128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training_1/Adam/mul_23}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[metrics_1/acc/Identity/_323]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[420128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training_1/Adam/mul_23}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
您可以在 tensorflow 中使用分布式策略来确保正确使用您的多 GPU 设置:
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
for test_indel in range(1,11):
<etc>
查看文档 here
镜像策略用于在单个服务器上跨多个 GPU 进行同步分布式训练,这听起来像您正在使用的设置。还有更直观的解释in this blog.
此外,您可以尝试使用 mixed precision,它应该通过改变模型中参数的浮点类型来显着释放内存。
所以...这个问题我查了几个post(应该还有很多没查但是我觉得现在有问题求助是合理的),但是我还没有找到适合我情况的解决方案。
这个 OOM 错误消息总是出现(没有一个例外)在第二轮的任何折叠训练循环中,并且在第一个 [=45= 之后再次重新运行训练代码时].所以这可能是与此 post:
我的 NN 是一个具有两个图形卷积层的 GCN,我 运行 在具有多个 10 GB Nvidia P102-100 GPU 的服务器上编写代码。已将 batch_size 设置为 1 但没有任何变化。我也在使用 Jupyter Notebook 而不是 运行ning python 脚本和命令,因为在命令行中我什至不能 运行 一轮...顺便说一句,有人知道为什么有些代码可以 运行 在命令行中弹出 OOM 时在 Jupyter 上没有问题?我觉得有点奇怪。
更新:用 GlobalMaxPool() 替换 Flatten() 后,错误消失了,我可以 运行 顺利地编写代码。但是,如果我再增加一个GC层,错误就会出现在第一轮。因此,我想核心问题仍然存在...
更新 2:尝试用 tf.SparseTensor
替换 tf.Tensor
。成功但没有用。也尝试按照 ML_Engine 的回答中提到的设置镜像策略,但看起来其中一个 GPU 占用最高,并且 OOM 仍然出现。也许它是一种“数据并行”并且无法解决我的问题,因为我已将 batch_size
设置为 1?
代码(改编自GCNG):
from keras import Input, Model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Flatten
from keras.optimizers import Adam
from keras.regularizers import l2
import tensorflow as tf
#from spektral.datasets import mnist
from spektral.layers import GraphConv
from spektral.layers.ops import sp_matrix_to_sp_tensor
from spektral.utils import normalized_laplacian
from keras.utils import plot_model
from sklearn import metrics
import numpy as np
import gc
l2_reg = 5e-7 # Regularization rate for l2
learning_rate = 1*1e-6 # Learning rate for SGD
batch_size = 1 # Batch size
epochs = 1 # Number of training epochs
es_patience = 50 # Patience fot early stopping
# DATA IMPORTING & PREPROCESSING OMITTED
# this part of adjacency matrix calculation is not important...
fltr = self_connection_normalized_adjacency(adj)
test = fltr.toarray()
t = tf.convert_to_tensor(test)
A_in = Input(tensor=t)
del fltr, test, t
gc.collect()
# Here comes the issue.
for test_indel in range(1,11):
# SEVERAL LINES OMITTED (get X_train, y_train, X_val, y_val, X_test, y_test)
# Build model
N = X_train.shape[-2] # Number of nodes in the graphs
F = X_train.shape[-1] # Node features dimensionality
n_out = y_train.shape[-1] # Dimension of the target
X_in = Input(shape=(N, F))
graph_conv = GraphConv(32,activation='elu',kernel_regularizer=l2(l2_reg),use_bias=True)([X_in, A_in])
graph_conv = GraphConv(32,activation='elu',kernel_regularizer=l2(l2_reg),use_bias=True)([graph_conv, A_in])
flatten = Flatten()(graph_conv)
fc = Dense(512, activation='relu')(flatten)
output = Dense(n_out, activation='sigmoid')(fc)
model = Model(inputs=[X_in, A_in], outputs=output)
optimizer = Adam(lr=learning_rate)
model.compile(optimizer=optimizer,loss='binary_crossentropy',metrics=['acc'])
model.summary()
save_dir = current_path+'/'+str(test_indel)+'_self_connection_Ycv_LR_as_nega_rg_5-7_lr_1-6_e'+str(epochs)
if not os.path.isdir(save_dir):
os.makedirs(save_dir)
early_stopping = EarlyStopping(monitor='val_acc', patience=es_patience, verbose=0, mode='auto')
checkpoint1 = ModelCheckpoint(filepath=save_dir + '/weights.{epoch:02d}-{val_loss:.2f}.hdf5', monitor='val_loss',verbose=1, save_best_only=False, save_weights_only=False, mode='auto', period=1)
checkpoint2 = ModelCheckpoint(filepath=save_dir + '/weights.hdf5', monitor='val_acc', verbose=1,save_best_only=True, mode='auto', period=1)
callbacks = [checkpoint2, early_stopping]
# Train model
validation_data = (X_val, y_val)
print('batch size = '+str(batch_size))
history = model.fit(X_train,y_train,batch_size=batch_size,validation_data=validation_data,epochs=epochs,callbacks=callbacks)
# Prediction and write-file code omitted
del X_in, X_data_train,Y_data_train,gene_pair_index_train,count_setx_train,X_data_test, Y_data_test,gene_pair_index_test,trainX_index,validation_index,train_index, X_train, y_train, X_val, y_val, X_test, y_test, validation_data, graph_conv, flatten, fc, output, model, optimizer, history
gc.collect()
模型汇总:
Model: "model_1"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_2 (InputLayer) (None, 13129, 2) 0
__________________________________________________________________________________________________
input_1 (InputLayer) (13129, 13129) 0
__________________________________________________________________________________________________
graph_conv_1 (GraphConv) (None, 13129, 32) 96 input_2[0][0]
input_1[0][0]
__________________________________________________________________________________________________
graph_conv_2 (GraphConv) (None, 13129, 32) 1056 graph_conv_1[0][0]
input_1[0][0]
__________________________________________________________________________________________________
flatten_1 (Flatten) (None, 420128) 0 graph_conv_2[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 512) 215106048 flatten_1[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 1) 513 dense_1[0][0]
==================================================================================================
Total params: 215,107,713
Trainable params: 215,107,713
Non-trainable params: 0
__________________________________________________________________________________________________
batch size = 1
错误消息(请注意,在重启并清除输出后的第一轮中,此消息永远不会出现):
Train on 2953 samples, validate on 739 samples
Epoch 1/1
---------------------------------------------------------------------------
ResourceExhaustedError Traceback (most recent call last)
<ipython-input-5-943385df49dc> in <module>()
62 mem = psutil.virtual_memory()
63 print("current mem " + str(round(mem.percent))+'%')
---> 64 history = model.fit(X_train,y_train,batch_size=batch_size,validation_data=validation_data,epochs=epochs,callbacks=callbacks)
65 mem = psutil.virtual_memory()
66 print("current mem " + str(round(mem.percent))+'%')
/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
1237 steps_per_epoch=steps_per_epoch,
1238 validation_steps=validation_steps,
-> 1239 validation_freq=validation_freq)
1240
1241 def evaluate(self,
/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/keras/engine/training_arrays.py in fit_loop(model, fit_function, fit_inputs, out_labels, batch_size, epochs, verbose, callbacks, val_function, val_inputs, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq)
194 ins_batch[i] = ins_batch[i].toarray()
195
--> 196 outs = fit_function(ins_batch)
197 outs = to_list(outs)
198 for l, o in zip(out_labels, outs):
/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/tensorflow/python/keras/backend.py in __call__(self, inputs)
3290
3291 fetched = self._callable_fn(*array_vals,
-> 3292 run_metadata=self.run_metadata)
3293 self._call_fetch_callbacks(fetched[-len(self._fetches):])
3294 output_structure = nest.pack_sequence_as(
/public/workspace/miniconda3/envs/ST/lib/python3.6/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
1456 ret = tf_session.TF_SessionRunCallable(self._session._session,
1457 self._handle, args,
-> 1458 run_metadata_ptr)
1459 if run_metadata:
1460 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[420128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training_1/Adam/mul_23}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[metrics_1/acc/Identity/_323]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[420128,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node training_1/Adam/mul_23}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
您可以在 tensorflow 中使用分布式策略来确保正确使用您的多 GPU 设置:
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
for test_indel in range(1,11):
<etc>
查看文档 here
镜像策略用于在单个服务器上跨多个 GPU 进行同步分布式训练,这听起来像您正在使用的设置。还有更直观的解释in this blog.
此外,您可以尝试使用 mixed precision,它应该通过改变模型中参数的浮点类型来显着释放内存。