Blas GEMM 启动失败：

Question

我正在尝试在预训练的 CNN 模型之上制作一个密集分类器。配置了工作 GPU，tensorflow 也使用 GPU 进行操作。我的环境不是由 anaconda 创建的，它有以下包： IDE - Pycharm，TF = 2.4.0，CUDA = 11.0

但我不是因为

而无法获得输出

Blas GEMM launch failed : a.shape=(20, 8192), b.shape=(8192, 256), m=20, n=256, k=8192

也表明

failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED

我已经检查了网站并尝试设置

configuration.gpu_options.allow_growth = True

但这并没有帮助。我还尝试了以下代码：

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

但是输出错误还是一样。在不同的堆栈答案中，我发现了一些其他的解决方案也没有用。

我的代码是这样的

import os
import numpy as np
import shutil
import matplotlib.pyplot as plt

import tensorflow as tf
print(tf.test.gpu_device_name())

print(tf.__version__)


from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import optimizers
from tensorflow.keras.preprocessing.image import ImageDataGenerator



from tensorflow.keras.applications import VGG16
conv_base = VGG16(weights = 'imagenet', include_top = False, input_shape = (150,150,3))

base_dir = 'C:/Users/emamu/Downloads/cat_and_dog_small'
train_dir = os.path.join(base_dir, 'train')
val_dir = os.path.join(base_dir, 'val')
test_dir = os.path.join(base_dir, 'test')

data_generator_unaugmented = ImageDataGenerator(1./255)
batch_size = 20

def extract_feature(directory, sample_count):
    features = np.zeros(shape=(sample_count, 4, 4, 512))
    labels = np.zeros(shape=(sample_count))

    generator = data_generator_unaugmented.flow_from_directory(directory,
                                                               target_size = (150, 150),
                                                               batch_size = batch_size,
                                                               class_mode = 'binary')

    i = 0
    for input_batch, label_batch in generator:
        feature_batch = conv_base.predict(input_batch)
        features[i*batch_size:(i+1)*batch_size] = feature_batch
        labels[i*batch_size:(i+1)*batch_size] = label_batch
        i=i+1
        if i*batch_size >= sample_count:
            break
    return features, labels

train_feature, train_label = extract_feature(train_dir, 2000)
val_feature, val_label = extract_feature(val_dir, 1000)
test_feature, test_label = extract_feature(test_dir, 1000)

train_feature = np.reshape(train_feature, (2000, 4*4*512))
val_feature = np.reshape(val_feature, (1000, 4*4*512))
test_feature = np.reshape(test_feature, (1000, 4*4*512))


# Creating a classifier model on top

model = models.Sequential()
model.add(layers.Dense(256, activation='relu', input_dim=4*4*512))
model.add(layers.Dropout(0.4))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer=optimizers.RMSprop(lr=1e-5), loss='binary_crossentropy', metrics=['acc'])

history = model.fit(train_feature, train_label,
                    epochs = 30, batch_size = 20,
                    validation_data=(val_feature, val_label))

acc = history.history['acc']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc)+1)
plt.plot(epochs, acc, label = 'training accuracy')
plt.plot(epochs, val_acc, label = 'val accuracy')
plt.figure()
plt.show()

plt.plot(epochs, loss, label = 'training loss')
plt.plot(epochs, val_loss, label = 'val loss')
plt.figure()
plt.show()

无论我尝试什么，我都会收到以下错误：

C:\Users\emamu\PycharmProjects\gput_test_01\venv\Scripts\python.exe C:/Users/emamu/PycharmProjects/gput_test_01/cnn_transfer_learning.py
2021-05-06 12:00:03.592087: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-05-06 12:00:05.532458: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-05-06 12:00:05.534984: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-05-06 12:00:05.559708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 178.84GiB/s
2021-05-06 12:00:05.559871: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-05-06 12:00:05.568725: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-05-06 12:00:05.568814: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-05-06 12:00:05.572032: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-05-06 12:00:05.573585: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-05-06 12:00:05.580959: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-05-06 12:00:05.583500: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-05-06 12:00:05.584151: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-05-06 12:00:05.584333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
/device:GPU:0
2.4.0
2021-05-06 12:00:06.133316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-06 12:00:06.133441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-05-06 12:00:06.133513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-05-06 12:00:06.133740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 2903 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-05-06 12:00:06.134445: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-06 12:00:06.145654: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-06 12:00:06.146140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 178.84GiB/s
2021-05-06 12:00:06.146343: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-05-06 12:00:06.146433: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-05-06 12:00:06.146522: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-05-06 12:00:06.146610: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-05-06 12:00:06.146699: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-05-06 12:00:06.146787: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-05-06 12:00:06.146880: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-05-06 12:00:06.146972: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-05-06 12:00:06.147095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-05-06 12:00:06.147534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 178.84GiB/s
2021-05-06 12:00:06.147732: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-05-06 12:00:06.147840: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-05-06 12:00:06.147933: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-05-06 12:00:06.148021: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-05-06 12:00:06.148106: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-05-06 12:00:06.148193: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-05-06 12:00:06.148276: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-05-06 12:00:06.148359: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-05-06 12:00:06.148464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-05-06 12:00:06.148562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-06 12:00:06.148646: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-05-06 12:00:06.148699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-05-06 12:00:06.148810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2903 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-05-06 12:00:06.148979: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Found 2000 images belonging to 2 classes.
C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\keras_preprocessing\image\image_data_generator.py:720: UserWarning: This ImageDataGenerator specifies `featurewise_center`, but it hasn't been fit on any training data. Fit it first by calling `.fit(numpy_data)`.
  warnings.warn('This ImageDataGenerator specifies '
2021-05-06 12:00:06.750321: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-05-06 12:00:06.876918: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-05-06 12:00:08.024040: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0

2021-05-06 12:00:08.087066: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0

2021-05-06 12:00:08.222607: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-05-06 12:00:08.720199: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:09.177801: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.66GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-05-06 12:00:09.178227: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.66GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-05-06 12:00:09.257660: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:09.587331: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:10.196301: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:10.486611: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:11.055390: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:11.328883: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:11.828937: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:12.031743: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
Found 1000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.
Epoch 1/30
2021-05-06 12:00:45.989644: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:45.990645: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:45.991702: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:45.992462: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:45.993468: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:46.000724: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:46.000869: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "C:/Users/emamu/PycharmProjects/gput_test_01/cnn_transfer_learning.py", line 92, in <module>
    history = model.fit(train_feature, train_label,
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1100, in fit
    tmp_logs = self.train_function(iterator)
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 888, in _call
    return self._stateless_fn(*args, **kwds)
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\eager\function.py", line 2942, in __call__
    return graph_function._call_flat(
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\eager\function.py", line 1918, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\eager\function.py", line 555, in call
    outputs = execute.execute(
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError:  Blas GEMM launch failed : a.shape=(20, 8192), b.shape=(8192, 256), m=20, n=256, k=8192
     [[node sequential/dense/MatMul (defined at /Users/emamu/PycharmProjects/gput_test_01/cnn_transfer_learning.py:92) ]] [Op:__inference_train_function_11890]

Function call stack:
train_function


Process finished with exit code 1

此错误的实际原因是什么？这个问题有永久性的解决方法吗？

Answer 1

我已经解决了这个问题。该过程与 tensorflow website 状态相同。但是，如果安装了任何较旧的 CUDA 或 cuDNN，则首先从 PC 中删除所有 CUDA 驱动程序并安装 visual studio。然后，最重要的部分是选择正确的 CUDA 和 cuDNN 版本。如果TF的版本是2.4.x，则选择CUDA 11.0.x和8.05的cuDNN（或任何只支持CUDA 11.0.x的版本）。

要了解有关版本的更多详细信息，请查看 this 页面。再说一次，tensorflow、CUDA和cuDNN的版本是tensorflow-gpu安装过程中最关键的部分。