为什么使用的 gpu 设备与日志信息不一致?

Why is gpu device used not consistent with log info?

我的机器有4个GPU,当我运行代码的时候,一开始我已经设置:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

通过nvidia-smi命令可以看到gpu 1确实被使用了。但是终端的tensorflow log显示使用了gpu 0:

2021-09-24 02:27:55.691073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:00:0d.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.75GiB deviceMemoryBandwidth: 836.37GiB/s
2021-09-24 02:27:55.691123: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-09-24 02:27:55.694585: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-09-24 02:27:55.698234: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-09-24 02:27:55.698776: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-09-24 02:27:55.702390: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-09-24 02:27:55.703656: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-09-24 02:27:55.709853: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-09-24 02:27:55.710078: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-24 02:27:55.711069: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-24 02:27:55.711917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0

...

2021-09-24 02:27:55.906440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2021-09-24 02:27:55.906571: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-09-24 02:27:57.342555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-24 02:27:57.342608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2021-09-24 02:27:57.342619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 
2021-09-24 02:27:57.342980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-24 02:27:57.343982: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-24 02:27:57.344891: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14419 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:0d.0, compute capability: 7.0)

我有两个问题:

  1. GPU 0 确实被使用了,但是被另一个进程使用了​​。在我的代码中,它使用的是gpu 1。我想知道为什么上面的日志与实际使用的设备一致?

  2. 此外,Tensorflow 2 应该会自动检测可用的 GPU 并使用它。如果我不添加这一行:

    os.environ["CUDA_VISIBLE_DEVICES"] = "1"

日志显示它正在尝试使用 gpu= 0 并产生内存不足错误。

  1. CUDA_VISIBLE_DEVICES 环境变量 重新映射 无论您 select 哪个设备,这样就您的 CUDA 进程而言,这些设备(在你的列表)在 CUDA 看来就像它们从零开始一样。所以当你这样做时:

    os.environ["CUDA_VISIBLE_DEVICES"] = "1"
    

    此后,CUDA 将该设备视为设备 0。

  2. 仅仅因为一个 GPU 正在被另一个 process/user 使用,并不意味着它“不可用”供您使用。 CUDA 不会阻止两个用户或两个进程尝试使用同一个 GPU,在某些情况下,这种情况是 sensible/effective。因此 TF 将其视为可用设备,尝试使用它,但内存不足。这是人们使用上面 1 中列出的环境变量的一个典型原因。环境变量只会使某些设备对您的 TF 进程“可见”或“可用”。