CUDA MPS 服务器无法在具有多个 GPU 的工作站上启动
CUDA MPS servers fail to start on workstation with multiple GPUs
编辑:我试图通过使用它们的 UUID 而不是它们的 ID 来枚举有效的 GPU,这会导致它工作。
似乎它仍然看到了 GT 610,尽管我认为它不应该。这就是它不起作用的原因。
我在我的一台机器上使用 cuda MPS 时遇到问题。
该机器有 4 个 Tesla K80,以及一个(编辑:) 不支持 MPS 的 GT610
这里是 nvidia-smi:
riveale@coiworkstation1:~/code/psweep2/src$ nvidia-smi
Tue Mar 15 23:51:59 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 610 Off | 0000:01:00.0 N/A | N/A |
| 40% 29C P8 N/A / N/A | 3MiB / 1021MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:04:00.0 Off | 0 |
| N/A 29C P8 26W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 24C P8 30W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:08:00.0 Off | 0 |
| N/A 34C P8 27W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:09:00.0 Off | 0 |
| N/A 28C P8 29W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:84:00.0 Off | 0 |
| N/A 31C P8 28W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:85:00.0 Off | 0 |
| N/A 26C P8 30W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:88:00.0 Off | 0 |
| N/A 31C P8 26W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 8 Tesla K80 Off | 0000:89:00.0 Off | 0 |
| N/A 25C P8 31W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
如您所见,我已经将处理器设置为独占进程。
我可以 运行 仅使用第一个 GPU、启动 MPS 服务器等进行完整性检查,如下所示:
export CUDA_VISIBLE_DEVICES="0"
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
然后我运行我的脚本:
NRANKS=4
mpirun -n $NRANKS gputest.exe
这成功 运行s,我在 /tmp/nvidia-log/server.log 中看到:
riveale@coiworkstation1:~/code/psweep2/src$ cat /tmp/nvidia-log/server.log
[2016-03-15 23:57:07.883 Other 6957] Start
[2016-03-15 23:57:08.513 Other 6957] New client 6956 connected
[2016-03-15 23:57:08.513 Other 6957] New client 6954 connected
[2016-03-15 23:57:08.514 Other 6957] New client 6955 connected
但是,当我尝试在系统上使用超过 1 个 GPU 时,我遇到了问题。具体来说,如下(完全一样,但现在我有 2 个可见的 CUDA 设备):
export CUDA_VISIBLE_DEVICES="0,1"
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
(ps ax | grep mps 显示守护进程启动正常,与上面的工作示例没有区别)。
紧随其后:
NRANKS=7
mpirun -n $NRANKS gputest.exe
我明白了:
riveale@coiworkstation1:~/code/psweep2/src$ cat /tmp/nvidia-log/server.log
[2016-03-15 23:59:55.718 Other 7102] Start
[2016-03-15 23:59:56.301 Other 7102] MPS server failed to start
[2016-03-15 23:59:56.301 Other 7102] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-15 23:59:56.727 Other 7105] Start
[2016-03-15 23:59:57.302 Other 7105] MPS server failed to start
[2016-03-15 23:59:57.302 Other 7105] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-15 23:59:57.718 Other 7107] Start
[2016-03-15 23:59:58.291 Other 7107] MPS server failed to start
[2016-03-15 23:59:58.291 Other 7107] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-15 23:59:58.709 Other 7109] Start
[2016-03-15 23:59:59.236 Other 7109] MPS server failed to start
[2016-03-15 23:59:59.236 Other 7109] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-15 23:59:59.644 Other 7111] Start
[2016-03-16 00:00:00.215 Other 7111] MPS server failed to start
[2016-03-16 00:00:00.215 Other 7111] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-16 00:00:00.651 Other 7113] Start
[2016-03-16 00:00:01.221 Other 7113] MPS server failed to start
[2016-03-16 00:00:01.221 Other 7113] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
奇怪。
提前感谢您help/ideas。
另一个奇怪的是,完全相同的东西 在我的其他工作站上工作 ,它具有相同的设置,只是它有一个 Quadro K620 而不是 GT610。 K620 是一个 CUDA 设备,所以我觉得这就是问题所在。现在我在远程,所以我无法切换卡片以查看这是否会改变问题。
如编辑中所标记,解决方案是使用 cc >3.5 GPU 的 UUID 并将 CUDA_VISIBLE_DEVICES 设置为该值。似乎无论出于何种原因,即使设备 0 正确地是 K80 之一,它出于某种原因将显示设备(610 等)列为设备 #1,而不是我预期的最后一个设备。
例如:
riveale@coiworkstation0:~$ nvidia-smi -L
GPU 0: Quadro K620 (UUID: GPU-1685f2e0-0f3a-fef1-c94c-00bf21afeb24)
GPU 1: Tesla K80 (UUID: GPU-9e8b10fb-8005-24c7-b7aa-5795c39b4c15)
GPU 2: Tesla K80 (UUID: GPU-3d917409-02ae-079b-3941-bacd9570b8c6)
GPU 3: Tesla K80 (UUID: GPU-8faf997f-67a1-b729-6205-1da501a39470)
GPU 4: Tesla K80 (UUID: GPU-99da7098-9e60-d67a-c5c8-de52e4b30c30)
riveale@coiworkstation0:~$ export CUDA_VISIBLE_DEVICES="GPU-9e8b10fb-8005-24c7-b7aa-5795c39b4c15,GPU-3d917409-02ae-079b-3941-bacd9570b8c6,GPU-8faf997f-67a1-b729-6205-1da501a39470,GPU-99da7098-9e60-d67a-c5c8-de52e4b30c30"
在启动上面的 nvidia-cuda-mps-control -d 脚本之前,我必须在每个 node/machine.
上执行此操作
事实证明 MPS 很慢(MPS 服务器占用了很多 CPU),所以我决定不使用它。
编辑:我试图通过使用它们的 UUID 而不是它们的 ID 来枚举有效的 GPU,这会导致它工作。
似乎它仍然看到了 GT 610,尽管我认为它不应该。这就是它不起作用的原因。
我在我的一台机器上使用 cuda MPS 时遇到问题。
该机器有 4 个 Tesla K80,以及一个(编辑:) 不支持 MPS 的 GT610
这里是 nvidia-smi:
riveale@coiworkstation1:~/code/psweep2/src$ nvidia-smi
Tue Mar 15 23:51:59 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 610 Off | 0000:01:00.0 N/A | N/A |
| 40% 29C P8 N/A / N/A | 3MiB / 1021MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:04:00.0 Off | 0 |
| N/A 29C P8 26W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 24C P8 30W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:08:00.0 Off | 0 |
| N/A 34C P8 27W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:09:00.0 Off | 0 |
| N/A 28C P8 29W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:84:00.0 Off | 0 |
| N/A 31C P8 28W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:85:00.0 Off | 0 |
| N/A 26C P8 30W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:88:00.0 Off | 0 |
| N/A 31C P8 26W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 8 Tesla K80 Off | 0000:89:00.0 Off | 0 |
| N/A 25C P8 31W / 149W | 55MiB / 11519MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
如您所见,我已经将处理器设置为独占进程。
我可以 运行 仅使用第一个 GPU、启动 MPS 服务器等进行完整性检查,如下所示:
export CUDA_VISIBLE_DEVICES="0"
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
然后我运行我的脚本:
NRANKS=4
mpirun -n $NRANKS gputest.exe
这成功 运行s,我在 /tmp/nvidia-log/server.log 中看到:
riveale@coiworkstation1:~/code/psweep2/src$ cat /tmp/nvidia-log/server.log
[2016-03-15 23:57:07.883 Other 6957] Start
[2016-03-15 23:57:08.513 Other 6957] New client 6956 connected
[2016-03-15 23:57:08.513 Other 6957] New client 6954 connected
[2016-03-15 23:57:08.514 Other 6957] New client 6955 connected
但是,当我尝试在系统上使用超过 1 个 GPU 时,我遇到了问题。具体来说,如下(完全一样,但现在我有 2 个可见的 CUDA 设备):
export CUDA_VISIBLE_DEVICES="0,1"
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
(ps ax | grep mps 显示守护进程启动正常,与上面的工作示例没有区别)。 紧随其后:
NRANKS=7
mpirun -n $NRANKS gputest.exe
我明白了:
riveale@coiworkstation1:~/code/psweep2/src$ cat /tmp/nvidia-log/server.log
[2016-03-15 23:59:55.718 Other 7102] Start
[2016-03-15 23:59:56.301 Other 7102] MPS server failed to start
[2016-03-15 23:59:56.301 Other 7102] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-15 23:59:56.727 Other 7105] Start
[2016-03-15 23:59:57.302 Other 7105] MPS server failed to start
[2016-03-15 23:59:57.302 Other 7105] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-15 23:59:57.718 Other 7107] Start
[2016-03-15 23:59:58.291 Other 7107] MPS server failed to start
[2016-03-15 23:59:58.291 Other 7107] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-15 23:59:58.709 Other 7109] Start
[2016-03-15 23:59:59.236 Other 7109] MPS server failed to start
[2016-03-15 23:59:59.236 Other 7109] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-15 23:59:59.644 Other 7111] Start
[2016-03-16 00:00:00.215 Other 7111] MPS server failed to start
[2016-03-16 00:00:00.215 Other 7111] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
[2016-03-16 00:00:00.651 Other 7113] Start
[2016-03-16 00:00:01.221 Other 7113] MPS server failed to start
[2016-03-16 00:00:01.221 Other 7113] MPS is only supported on 64-bit Linux platforms, with an SM 3.5 or higher GPU.
奇怪。
提前感谢您help/ideas。
另一个奇怪的是,完全相同的东西 在我的其他工作站上工作 ,它具有相同的设置,只是它有一个 Quadro K620 而不是 GT610。 K620 是一个 CUDA 设备,所以我觉得这就是问题所在。现在我在远程,所以我无法切换卡片以查看这是否会改变问题。
如编辑中所标记,解决方案是使用 cc >3.5 GPU 的 UUID 并将 CUDA_VISIBLE_DEVICES 设置为该值。似乎无论出于何种原因,即使设备 0 正确地是 K80 之一,它出于某种原因将显示设备(610 等)列为设备 #1,而不是我预期的最后一个设备。
例如:
riveale@coiworkstation0:~$ nvidia-smi -L
GPU 0: Quadro K620 (UUID: GPU-1685f2e0-0f3a-fef1-c94c-00bf21afeb24)
GPU 1: Tesla K80 (UUID: GPU-9e8b10fb-8005-24c7-b7aa-5795c39b4c15)
GPU 2: Tesla K80 (UUID: GPU-3d917409-02ae-079b-3941-bacd9570b8c6)
GPU 3: Tesla K80 (UUID: GPU-8faf997f-67a1-b729-6205-1da501a39470)
GPU 4: Tesla K80 (UUID: GPU-99da7098-9e60-d67a-c5c8-de52e4b30c30)
riveale@coiworkstation0:~$ export CUDA_VISIBLE_DEVICES="GPU-9e8b10fb-8005-24c7-b7aa-5795c39b4c15,GPU-3d917409-02ae-079b-3941-bacd9570b8c6,GPU-8faf997f-67a1-b729-6205-1da501a39470,GPU-99da7098-9e60-d67a-c5c8-de52e4b30c30"
在启动上面的 nvidia-cuda-mps-control -d 脚本之前,我必须在每个 node/machine.
上执行此操作事实证明 MPS 很慢(MPS 服务器占用了很多 CPU),所以我决定不使用它。