Tf 2:无法创建 cudnn 句柄:CUDNN_STATUS_INTERNAL_ERROR
Tf 2: Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
当我执行下面的代码时出现上述错误(无法创建 cudnn 句柄:CUDNN_STATUS_INTERNAL_ERROR)。我检查了我的 gpu 是否正在使用 tf.test.is_gpu_available
# coding: utf-8
import tensorflow as tf
import numpy as np
import keras
from models import *
import os
import gc
TF_FORCE_GPU_ALLOW_GROWTH = True
np.random.seed(1000)
#Paths
MODEL_CONF = "../models/conf/"
MODEL_WEIGHTS = "../models/weights/"
#Model informations
N_CLASSES = 3
def load_array(name):
return np.load(name, allow_pickle = True)
gc.collect()
dirData = "saved_data/"
trainDir = dirData + "train/"
model = AdaptedLeNet((168, 168, 8), N_CLASSES)
model.summary(print_fn=lambda x: print(x + '\n'))
# Compile the model with the specified loss function.
model.compile(optimizer=keras.optimizers.Adam(),
loss='categorical_crossentropy',
metrics=['accuracy'])
for filename in os.listdir(trainDir):
data = load_array(trainDir + filename)
train = data["a"]
labels = data["b"].astype(int).reshape(-1)
one_hot_targets = np.eye(N_CLASSES)[labels]
model.fit(x=train, y=one_hot_targets, batch_size=32, epochs=5)
gc.collect()
这段代码的输出是:
Epoch 1/5
2020-04-03 18:50:43.397010: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-03 18:50:43.608330: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-03 18:50:44.274270: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-03 18:50:44.275686: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-03 18:50:44.275747: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
Traceback (most recent call last):
File "cnnAlert.py", line 62, in <module>
model.fit(x=train, y=one_hot_targets, batch_size=32, epochs=5)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/keras/engine/training.py", line 1239, in fit
validation_freq=validation_freq)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
outs = fit_function(ins_batch)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py", line 3727, in __call__
outputs = self._graph_fn(*converted_inputs)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1551, in __call__
return self._call_impl(args, kwargs)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1591, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
ctx=ctx)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv2d_1/convolution (defined at /home/geodatin/env/py3GEE/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_2350]
Function call stack:
keras_scratch_graph
更多信息:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1660 Off | 00000000:01:00.0 On | N/A |
| 27% 41C P8 9W / 120W | 211MiB / 5911MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 989 G /usr/lib/xorg/Xorg 78MiB |
| 0 1438 G cinnamon 31MiB |
| 0 8622 G ...uest-channel-token=16736224539216711033 99MiB |
+-----------------------------------------------------------------------------+
3
如何解决这个错误?你能帮助我吗?
编辑 1
- CUDNN_VERSION 来自 cudnn.h : 7605 (7.6.5)
- 主机编译器版本:GCC 7.5.0
- 张量流:2.1.0-rc0;
- CUDNN 库在我的 LD_LIBRARY_PATH
您可能需要将 tensorflow 会话 config.gpu_option.allow_growth 设置为 true,这可以通过在代码顶部添加以下内容来完成:
gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
keras.backend.tensorflow_backend.set_session(sess)
有一个 answer on a question about TF1.0 解决了如何为 TF2 执行此操作。该答案的建议对我有用,因此我将其复制到此处。 TF2 似乎正在远离 tf.Session
,所以我倾向于这个建议而不是这里的其他答案。
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)
当我执行下面的代码时出现上述错误(无法创建 cudnn 句柄:CUDNN_STATUS_INTERNAL_ERROR)。我检查了我的 gpu 是否正在使用 tf.test.is_gpu_available
# coding: utf-8
import tensorflow as tf
import numpy as np
import keras
from models import *
import os
import gc
TF_FORCE_GPU_ALLOW_GROWTH = True
np.random.seed(1000)
#Paths
MODEL_CONF = "../models/conf/"
MODEL_WEIGHTS = "../models/weights/"
#Model informations
N_CLASSES = 3
def load_array(name):
return np.load(name, allow_pickle = True)
gc.collect()
dirData = "saved_data/"
trainDir = dirData + "train/"
model = AdaptedLeNet((168, 168, 8), N_CLASSES)
model.summary(print_fn=lambda x: print(x + '\n'))
# Compile the model with the specified loss function.
model.compile(optimizer=keras.optimizers.Adam(),
loss='categorical_crossentropy',
metrics=['accuracy'])
for filename in os.listdir(trainDir):
data = load_array(trainDir + filename)
train = data["a"]
labels = data["b"].astype(int).reshape(-1)
one_hot_targets = np.eye(N_CLASSES)[labels]
model.fit(x=train, y=one_hot_targets, batch_size=32, epochs=5)
gc.collect()
这段代码的输出是:
Epoch 1/5
2020-04-03 18:50:43.397010: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-03 18:50:43.608330: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-03 18:50:44.274270: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-03 18:50:44.275686: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-04-03 18:50:44.275747: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
Traceback (most recent call last):
File "cnnAlert.py", line 62, in <module>
model.fit(x=train, y=one_hot_targets, batch_size=32, epochs=5)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/keras/engine/training.py", line 1239, in fit
validation_freq=validation_freq)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
outs = fit_function(ins_batch)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py", line 3727, in __call__
outputs = self._graph_fn(*converted_inputs)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1551, in __call__
return self._call_impl(args, kwargs)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1591, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 545, in call
ctx=ctx)
File "/home/geodatin/env/py3GEE/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv2d_1/convolution (defined at /home/geodatin/env/py3GEE/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_2350]
Function call stack:
keras_scratch_graph
更多信息:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1660 Off | 00000000:01:00.0 On | N/A |
| 27% 41C P8 9W / 120W | 211MiB / 5911MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 989 G /usr/lib/xorg/Xorg 78MiB |
| 0 1438 G cinnamon 31MiB |
| 0 8622 G ...uest-channel-token=16736224539216711033 99MiB |
+-----------------------------------------------------------------------------+
3
如何解决这个错误?你能帮助我吗?
编辑 1
- CUDNN_VERSION 来自 cudnn.h : 7605 (7.6.5)
- 主机编译器版本:GCC 7.5.0
- 张量流:2.1.0-rc0;
- CUDNN 库在我的 LD_LIBRARY_PATH
您可能需要将 tensorflow 会话 config.gpu_option.allow_growth 设置为 true,这可以通过在代码顶部添加以下内容来完成:
gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
keras.backend.tensorflow_backend.set_session(sess)
有一个 answer on a question about TF1.0 解决了如何为 TF2 执行此操作。该答案的建议对我有用,因此我将其复制到此处。 TF2 似乎正在远离 tf.Session
,所以我倾向于这个建议而不是这里的其他答案。
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)