GPU 上的 Tensorflow OOM
Tensorflow OOM on GPU
我正在 Tensorflow 中的 LSTM-RNN 上训练一些音乐数据,遇到了一些我不明白的 GPU 内存分配问题:我遇到了 OOM,而实际上 VRAM 似乎足够了仍然可用。
一些背景:
我正在研究 Ubuntu Gnome 16.04,使用 GTX1060 6GB、Intel Xeon E3-1231V3 和 8GB 内存。
所以现在首先是我可以理解的错误消息部分,我将在最后再次添加整个错误消息,以供可能需要它帮助的任何人使用:
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of
size 256 totalling 2.0KiB I
tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size
1280 totalling 1.2KiB I
tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size
44288 totalling 216.2KiB I
tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size
56064 totalling 273.8KiB I
tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size
154350080 totalling 588.80MiB I
tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size
813400064 totalling 2.27GiB I
tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size
1612612352 totalling 1.50GiB I
tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of
in-use chunks: 4.35GiB I
tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 5484118016
InUse: 4670717952
MaxInUse: 5484118016
NumAllocs: 29
MaxAllocSize: 1612612352
W tensorflow/core/common_runtime/bfc_allocator.cc:274]
*********************___________*__***************************************************xxxxxxxxxxxxxx W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of
memory trying to allocate 775.72MiB. See logs for memory state. W
tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM
when allocating tensor with shape[14525,14000]
所以我可以读到最多可以分配 5484118016 字节,
4670717952 字节已被使用,另外 777.72MB = 775720000 字节将被分配。根据我的计算器,5484118016 字节 - 4670717952 字节 - 775720000 字节 = 37680064 字节。
因此,在为他想推入的新 Tensor 分配 space 之后,仍然应该有 37MB 的空闲 VRAM。这对我来说似乎也很合法,因为 Tensorflow 可能(我猜?)不会尝试分配比仍然可用的更多的 VRAM,而只是将其余数据保留在 RAM 或其他东西中。
现在我想我的想法有一些大错误,但如果有人能向我解释这个错误是什么,我将不胜感激。解决我的问题的明显策略是让我的批次变小一点,每个批次都在 1.5GB 左右可能太大了。我仍然很想知道那里的实际问题是什么。
编辑:我发现了一些让我尝试的东西:
config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC'
with tf.Session(config = config) as s:
这仍然不起作用,但由于 tensorflow 文档缺乏对内容的任何解释
gpu_options.allocator_type = 'BFC'
会的,我很想问问你们。
为任何感兴趣的人添加其余的错误消息:
很抱歉 copy/paste,但也许有人会 need/want 看到它,
在此先感谢您,
里昂
(gputensorflow) leon@ljksUbuntu:~/Tensorflow$ python Netzwerk_v0.5.1_gamma.py
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1060 6GB
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.40GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728): Total Chunks: 1, Chunks in use: 0 147.20MiB allocated for chunks. 147.20MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456): Total Chunks: 1, Chunks in use: 0 628.52MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 775.72MiB was 256.00MiB, Chunk State:
I tensorflow/core/common_runtime/bfc_allocator.cc:666] Size: 628.52MiB | Requested Size: 0B | in_use: 0, prev: Size: 147.20MiB | Requested Size: 147.20MiB | in_use: 1, next: Size: 54.8KiB | Requested Size: 54.7KiB | in_use: 1
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000000 of size 1280
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000600 of size 56064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1020800e100 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1020800e200 of size 44288
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208018f00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208019000 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208019100 of size 813400064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102387d1100 of size 56064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102387dec00 of size 154350080
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b11e00 of size 44288
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cb00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cc00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cd00 of size 154350080
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102722d4d00 of size 56064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b615a00 of size 44288
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620800 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620900 of size 813400064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102abdd8900 of size 813400064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc590900 of size 56064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc59e400 of size 56064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc5abf00 of size 154350080
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102e58df100 of size 154350080
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec12300 of size 44288
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec1d000 of size 44288
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec27d00 of size 1612612352
I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x1024ae4ff00 of size 659049984
I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x102722e2800 of size 154350080
I tensorflow/core/common_runtime/bfc_allocator.cc:693] Summary of in-use Chunks by size:
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 256 totalling 2.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 44288 totalling 216.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 56064 totalling 273.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 154350080 totalling 588.80MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 813400064 totalling 2.27GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1612612352 totalling 1.50GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 4.35GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 5484118016
InUse: 4670717952
MaxInUse: 5484118016
NumAllocs: 29
MaxAllocSize: 1612612352
W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********************___________*__***************************************************xxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 775.72MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[14525,14000]
Traceback (most recent call last):
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
return fn(*args)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
status, run_metadata)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000]
[[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Netzwerk_v0.5.1_gamma.py", line 171, in <module>
session.run(tf.global_variables_initializer())
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000]
[[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]]
Caused by op 'rnn/basic_lstm_cell/weights/Initializer/random_uniform', defined at:
File "Netzwerk_v0.5.1_gamma.py", line 94, in <module>
initial_state=initial_state, time_major=False) # time_major = FALSE currently
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 545, in dynamic_rnn
dtype=dtype)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 712, in _dynamic_rnn_loop
swap_memory=swap_memory)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2626, in while_loop
result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2459, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2409, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 697, in _time_step
(output, new_state) = call_cell()
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 683, in <lambda>
call_cell = lambda: cell(input_t, state)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 179, in __call__
concat = _linear([inputs, h], 4 * self._num_units, True, scope=scope)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 747, in _linear
"weights", [total_arg_size, output_size], dtype=dtype)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 988, in get_variable
custom_getter=custom_getter)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 890, in get_variable
custom_getter=custom_getter)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 348, in get_variable
validate_shape=validate_shape)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 333, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 684, in _get_single_variable
validate_shape=validate_shape)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 226, in __init__
expected_shape=expected_shape)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 303, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 673, in <lambda>
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/init_ops.py", line 360, in __call__
dtype, seed=self.seed)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/random_ops.py", line 246, in random_uniform
return math_ops.add(rnd * (maxval - minval), minval, name=name)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 73, in add
result = _op_def_lib.apply_op("Add", x=x, y=y, name=name)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
self._traceback = _extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[14525,14000]
[[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]]
试着看看这个
Be careful not to run the evaluation and training binary on the same
GPU or else you might run out of memory. Consider running the
evaluation on a separate GPU if available or suspending the training
binary while running the evaluation on the same GPU.
我通过减少 batch_size=52
解决了这个问题
只有减少内存使用才是减少batch_size。
Batch_size depends on your gpu graphics card, size of VRAM, Cache memory etc.
请喜欢这个
当在 GPU 上遇到 OOM 时,我认为更改 batch size
是首先尝试的正确选择。
For different GPU you may need different batch size based on the GPU
memory you have.
最近我遇到了类似的问题,做了很多调整来做不同类型的实验。
这是link到(也包括一些技巧)。
但是,在减小批次大小的同时,您可能会发现训练速度变慢了。所以如果你有多个 GPU,你可以使用它们。要检查你的 GPU,你可以在终端上写,
nvidia-smi
它将显示有关您的 gpu 机架的必要信息。
我遇到了同样的问题。我关闭了所有 anaconda 提示 windows 并清除了所有 python 任务。重新打开 Anaconda 提示 window 并执行 train.py 文件。下次它对我有用。 Anaconda 和 Python 终端正在占用不会为训练过程留下 space 的内存。
此外,如果上述方法不起作用,请尝试减少训练过程的批量大小。
希望对您有所帮助
我最近遇到了一个非常相似的错误,这是由于在尝试在不同的过程中进行训练时不小心在后台进行了训练过程 运行。停止一个立即修复错误。
有相同的 OOM 问题 运行宁模型排列一个接一个。似乎在完成一个模型,然后定义并 运行 一个新模型后,GPU 内存并没有完全清除以前的模型,并且某些东西正在内存中累积并导致最终的 OOM 错误。
g-eoj对另一个问题的回答:
keras.backend.clear_session()
应该清除以前的模型。来自 https://keras.io/backend/
销毁当前的 TF 图并创建一个新的。有助于避免旧模型/层的混乱。
运行宁并保存一个模型后,清除会话,然后 运行 下一个模型。
使用 Ctrl+ Shift + Esc 查看系统使用情况。您 运行 内存不足。结束不需要的任务的任务。它对我有用。
我发现原因真的很愚蠢。我想检查 NN 的架构,所以我在终端中加载了 tensorflow。即使加载了tensorflow没有使用,它仍然在分配资源。我关闭了终端,OOM 消失了
我正在 Tensorflow 中的 LSTM-RNN 上训练一些音乐数据,遇到了一些我不明白的 GPU 内存分配问题:我遇到了 OOM,而实际上 VRAM 似乎足够了仍然可用。 一些背景: 我正在研究 Ubuntu Gnome 16.04,使用 GTX1060 6GB、Intel Xeon E3-1231V3 和 8GB 内存。 所以现在首先是我可以理解的错误消息部分,我将在最后再次添加整个错误消息,以供可能需要它帮助的任何人使用:
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 256 totalling 2.0KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 44288 totalling 216.2KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 56064 totalling 273.8KiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 154350080 totalling 588.80MiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 813400064 totalling 2.27GiB I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1612612352 totalling 1.50GiB I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 4.35GiB I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 5484118016
InUse: 4670717952
MaxInUse: 5484118016
NumAllocs: 29
MaxAllocSize: 1612612352
W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********************___________*__***************************************************xxxxxxxxxxxxxx W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 775.72MiB. See logs for memory state. W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[14525,14000]
所以我可以读到最多可以分配 5484118016 字节, 4670717952 字节已被使用,另外 777.72MB = 775720000 字节将被分配。根据我的计算器,5484118016 字节 - 4670717952 字节 - 775720000 字节 = 37680064 字节。 因此,在为他想推入的新 Tensor 分配 space 之后,仍然应该有 37MB 的空闲 VRAM。这对我来说似乎也很合法,因为 Tensorflow 可能(我猜?)不会尝试分配比仍然可用的更多的 VRAM,而只是将其余数据保留在 RAM 或其他东西中。
现在我想我的想法有一些大错误,但如果有人能向我解释这个错误是什么,我将不胜感激。解决我的问题的明显策略是让我的批次变小一点,每个批次都在 1.5GB 左右可能太大了。我仍然很想知道那里的实际问题是什么。
编辑:我发现了一些让我尝试的东西:
config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC'
with tf.Session(config = config) as s:
这仍然不起作用,但由于 tensorflow 文档缺乏对内容的任何解释
gpu_options.allocator_type = 'BFC'
会的,我很想问问你们。
为任何感兴趣的人添加其余的错误消息:
很抱歉 copy/paste,但也许有人会 need/want 看到它,
在此先感谢您, 里昂
(gputensorflow) leon@ljksUbuntu:~/Tensorflow$ python Netzwerk_v0.5.1_gamma.py
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1060 6GB
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.40GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728): Total Chunks: 1, Chunks in use: 0 147.20MiB allocated for chunks. 147.20MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456): Total Chunks: 1, Chunks in use: 0 628.52MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 775.72MiB was 256.00MiB, Chunk State:
I tensorflow/core/common_runtime/bfc_allocator.cc:666] Size: 628.52MiB | Requested Size: 0B | in_use: 0, prev: Size: 147.20MiB | Requested Size: 147.20MiB | in_use: 1, next: Size: 54.8KiB | Requested Size: 54.7KiB | in_use: 1
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000000 of size 1280
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208000600 of size 56064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1020800e100 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1020800e200 of size 44288
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208018f00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208019000 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10208019100 of size 813400064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102387d1100 of size 56064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102387dec00 of size 154350080
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b11e00 of size 44288
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cb00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cc00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x10241b1cd00 of size 154350080
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102722d4d00 of size 56064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b615a00 of size 44288
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620800 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1027b620900 of size 813400064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102abdd8900 of size 813400064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc590900 of size 56064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc59e400 of size 56064
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102dc5abf00 of size 154350080
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102e58df100 of size 154350080
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec12300 of size 44288
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec1d000 of size 44288
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x102eec27d00 of size 1612612352
I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x1024ae4ff00 of size 659049984
I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x102722e2800 of size 154350080
I tensorflow/core/common_runtime/bfc_allocator.cc:693] Summary of in-use Chunks by size:
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 8 Chunks of size 256 totalling 2.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 44288 totalling 216.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 56064 totalling 273.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 154350080 totalling 588.80MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 813400064 totalling 2.27GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1612612352 totalling 1.50GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 4.35GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 5484118016
InUse: 4670717952
MaxInUse: 5484118016
NumAllocs: 29
MaxAllocSize: 1612612352
W tensorflow/core/common_runtime/bfc_allocator.cc:274] *********************___________*__***************************************************xxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 775.72MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:993] Resource exhausted: OOM when allocating tensor with shape[14525,14000]
Traceback (most recent call last):
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
return fn(*args)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
status, run_metadata)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000]
[[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Netzwerk_v0.5.1_gamma.py", line 171, in <module>
session.run(tf.global_variables_initializer())
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[14525,14000]
[[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]]
Caused by op 'rnn/basic_lstm_cell/weights/Initializer/random_uniform', defined at:
File "Netzwerk_v0.5.1_gamma.py", line 94, in <module>
initial_state=initial_state, time_major=False) # time_major = FALSE currently
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 545, in dynamic_rnn
dtype=dtype)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 712, in _dynamic_rnn_loop
swap_memory=swap_memory)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2626, in while_loop
result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2459, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2409, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 697, in _time_step
(output, new_state) = call_cell()
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 683, in <lambda>
call_cell = lambda: cell(input_t, state)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 179, in __call__
concat = _linear([inputs, h], 4 * self._num_units, True, scope=scope)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 747, in _linear
"weights", [total_arg_size, output_size], dtype=dtype)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 988, in get_variable
custom_getter=custom_getter)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 890, in get_variable
custom_getter=custom_getter)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 348, in get_variable
validate_shape=validate_shape)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 333, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 684, in _get_single_variable
validate_shape=validate_shape)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 226, in __init__
expected_shape=expected_shape)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 303, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 673, in <lambda>
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/init_ops.py", line 360, in __call__
dtype, seed=self.seed)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/random_ops.py", line 246, in random_uniform
return math_ops.add(rnd * (maxval - minval), minval, name=name)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 73, in add
result = _op_def_lib.apply_op("Add", x=x, y=y, name=name)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/leon/anaconda3/envs/gputensorflow/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
self._traceback = _extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[14525,14000]
[[Node: rnn/basic_lstm_cell/weights/Initializer/random_uniform = Add[T=DT_FLOAT, _class=["loc:@rnn/basic_lstm_cell/weights"], _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/basic_lstm_cell/weights/Initializer/random_uniform/mul, rnn/basic_lstm_cell/weights/Initializer/random_uniform/min)]]
试着看看这个
Be careful not to run the evaluation and training binary on the same GPU or else you might run out of memory. Consider running the evaluation on a separate GPU if available or suspending the training binary while running the evaluation on the same GPU.
我通过减少 batch_size=52
解决了这个问题
只有减少内存使用才是减少batch_size。
Batch_size depends on your gpu graphics card, size of VRAM, Cache memory etc.
请喜欢这个
当在 GPU 上遇到 OOM 时,我认为更改 batch size
是首先尝试的正确选择。
For different GPU you may need different batch size based on the GPU memory you have.
最近我遇到了类似的问题,做了很多调整来做不同类型的实验。
这是link到
但是,在减小批次大小的同时,您可能会发现训练速度变慢了。所以如果你有多个 GPU,你可以使用它们。要检查你的 GPU,你可以在终端上写,
nvidia-smi
它将显示有关您的 gpu 机架的必要信息。
我遇到了同样的问题。我关闭了所有 anaconda 提示 windows 并清除了所有 python 任务。重新打开 Anaconda 提示 window 并执行 train.py 文件。下次它对我有用。 Anaconda 和 Python 终端正在占用不会为训练过程留下 space 的内存。
此外,如果上述方法不起作用,请尝试减少训练过程的批量大小。
希望对您有所帮助
我最近遇到了一个非常相似的错误,这是由于在尝试在不同的过程中进行训练时不小心在后台进行了训练过程 运行。停止一个立即修复错误。
有相同的 OOM 问题 运行宁模型排列一个接一个。似乎在完成一个模型,然后定义并 运行 一个新模型后,GPU 内存并没有完全清除以前的模型,并且某些东西正在内存中累积并导致最终的 OOM 错误。
g-eoj对另一个问题的回答:
keras.backend.clear_session()
应该清除以前的模型。来自 https://keras.io/backend/ 销毁当前的 TF 图并创建一个新的。有助于避免旧模型/层的混乱。 运行宁并保存一个模型后,清除会话,然后 运行 下一个模型。
使用 Ctrl+ Shift + Esc 查看系统使用情况。您 运行 内存不足。结束不需要的任务的任务。它对我有用。
我发现原因真的很愚蠢。我想检查 NN 的架构,所以我在终端中加载了 tensorflow。即使加载了tensorflow没有使用,它仍然在分配资源。我关闭了终端,OOM 消失了