nvidia-smi 在未使用时显示 GPU 利用率
nvidia-smi shows GPU utilization when it's unused
我在 GPU id 1 上使用 export CUDA_VISIBLE_DEVICES=1
运行 tensorflow,nvidia-smi 中的一切看起来都不错,我的 python 进程在 gpu 1 上是 运行,内存和功耗显示 GPU 1 正在使用中。
但奇怪的是,未使用的 GPU 0(基于进程列表、内存、电源使用情况和常识)显示 96% 的易失性 GPU 利用率。
有人知道为什么吗?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20c Off | 0000:03:00.0 Off | 0 |
| 30% 41C P0 53W / 225W | 0MiB / 4742MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20c Off | 0000:43:00.0 Off | 0 |
| 36% 49C P0 95W / 225W | 4516MiB / 4742MiB | 63% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 5193 C python 4514MiB |
+-----------------------------------------------------------------------------+
运行 ps aux | grep 5193
查看哪个程序正在使用 GPU。
您的 GPU 启用了 ECC,因此您会看到高 CPU 或内存利用率。
During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.
When Persistence Mode is Disabled, driver deinitializes when there are no clients running (CUDA apps or nvidia-smi or XServer) and needs to initialize again before any GPU application (like nvidia-smi) can query its state thus causing ECC Scrubbing.
As a rule of thumb always run with Persistence Mode Enabled. Just run as root nvidia-smi -pm 1
. This will speed up application lunching by keeping the driver always loaded.
参考:https://devtalk.nvidia.com/default/topic/539632/k20-with-high-utilization-but-no-compute-processes-/
我在 GPU id 1 上使用 export CUDA_VISIBLE_DEVICES=1
运行 tensorflow,nvidia-smi 中的一切看起来都不错,我的 python 进程在 gpu 1 上是 运行,内存和功耗显示 GPU 1 正在使用中。
但奇怪的是,未使用的 GPU 0(基于进程列表、内存、电源使用情况和常识)显示 96% 的易失性 GPU 利用率。
有人知道为什么吗?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20c Off | 0000:03:00.0 Off | 0 |
| 30% 41C P0 53W / 225W | 0MiB / 4742MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20c Off | 0000:43:00.0 Off | 0 |
| 36% 49C P0 95W / 225W | 4516MiB / 4742MiB | 63% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 5193 C python 4514MiB |
+-----------------------------------------------------------------------------+
运行 ps aux | grep 5193
查看哪个程序正在使用 GPU。
您的 GPU 启用了 ECC,因此您会看到高 CPU 或内存利用率。
During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.
When Persistence Mode is Disabled, driver deinitializes when there are no clients running (CUDA apps or nvidia-smi or XServer) and needs to initialize again before any GPU application (like nvidia-smi) can query its state thus causing ECC Scrubbing.
As a rule of thumb always run with Persistence Mode Enabled. Just run as rootnvidia-smi -pm 1
. This will speed up application lunching by keeping the driver always loaded.
参考:https://devtalk.nvidia.com/default/topic/539632/k20-with-high-utilization-but-no-compute-processes-/