nvidia-smi 在未使用时显示 GPU 利用率

nvidia-smi shows GPU utilization when it's unused

我在 GPU id 1 上使用 export CUDA_VISIBLE_DEVICES=1 运行 tensorflow,nvidia-smi 中的一切看起来都不错,我的 python 进程在 gpu 1 上是 运行,内存和功耗显示 GPU 1 正在使用中。

但奇怪的是,未使用的 GPU 0(基于进程列表、内存、电源使用情况和常识)显示 96% 的易失性 GPU 利用率。

有人知道为什么吗?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20c          Off  | 0000:03:00.0     Off |                    0 |
| 30%   41C    P0    53W / 225W |      0MiB /  4742MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:43:00.0     Off |                    0 |
| 36%   49C    P0    95W / 225W |   4516MiB /  4742MiB |     63%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1      5193    C   python                                        4514MiB |
+-----------------------------------------------------------------------------+

运行 ps aux | grep 5193 查看哪个程序正在使用 GPU。

您的 GPU 启用了 ECC,因此您会看到高 CPU 或内存利用率。

During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.
When Persistence Mode is Disabled, driver deinitializes when there are no clients running (CUDA apps or nvidia-smi or XServer) and needs to initialize again before any GPU application (like nvidia-smi) can query its state thus causing ECC Scrubbing.
As a rule of thumb always run with Persistence Mode Enabled. Just run as root nvidia-smi -pm 1. This will speed up application lunching by keeping the driver always loaded.

参考:https://devtalk.nvidia.com/default/topic/539632/k20-with-high-utilization-but-no-compute-processes-/