NVidia 驱动程序停止在 Ubuntu 16.04 和 Tesla K80 GPU 的 AWS EC2 实例上工作
NVidia drivers stopped working on AWS EC2 instance with Ubuntu 16.04 and Tesla K80 GPU
我一直在使用带有 Tesla K80 GPU 的 AWS EC2 实例 运行 TensorFlow 代码。
我安装了 CUDA 9.0 和 cuDNN 7.1.4,我正在使用 TF 1.12,所有这些都在 Ubuntu 16.04
昨天之前一切正常,但今天 NVidia 驱动程序似乎由于某种原因停止了 运行ning :
ubuntu@ip-10-0-0-13:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
我检查了驱动程序:
ubuntu@ip-10-0-0-13:~$ dpkg -l | grep nvidia
rc nvidia-367 367.48-0ubuntu1 amd64 NVIDIA binary driver - version 367.48
ii nvidia-396 396.37-0ubuntu1 amd64 NVIDIA binary driver - version 396.37
ii nvidia-396-dev 396.37-0ubuntu1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-machine-learning-repo-ubuntu1604 1.0.0-1 amd64 nvidia-machine-learning repository configuration files
ii nvidia-modprobe 396.37-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
rc nvidia-opencl-icd-367 367.48-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-opencl-icd-396 396.37-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.8.2 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 396.37-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
似乎有 2 个不同的版本,这会是个问题吗? (但我不明白为什么之前一切正常)。
找到 this thread,我检查了我的内核,它与线程中提到的内核明显不同:
ubuntu@ip-10-0-0-13:~$ uname -a
Linux ip-10-0-0-13 4.4.0-143-generic #169-Ubuntu SMP Thu Feb 7 07:56:38 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
有没有人 运行 遇到过这个问题并且知道如何解决它?
提前感谢您的帮助!
编辑:
尝试使用@Dehydrated_Mud 的方法升级驱动程序时,出现以下错误:
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.
以及日志文件的内容:
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Mar 21 10:56:46 2019
installer version: 384.183
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
nvidia-installer command line:
./nvidia-installer
--no-drm
--disable-nouveau
--dkms
--silent
--install-libglvnd
Using built-in stream user interface
-> Detected 4 CPUs online; setting concurrency level to 4.
-> Installing NVIDIA driver version 384.183.
-> The NVIDIA driver appears to have been installed previously using a different installer. To prevent potential conflicts, it is recommended either to update the existing installation using the same mechanism by which it was originally installed, or to uninstall the existing installation before installing this driver.
Please review the message provided by the maintainer of this alternate installation method and decide how to proceed:
The package that is already installed is named nvidia-396.
You can upgrade the driver by running:
`apt-get install nvidia-396 nvidia-modprobe nvidia-settings`
You can remove nvidia-396, and all related packages, by running:
`apt-get remove --purge nvidia-396 nvidia-modprobe nvidia-settings`
This package is maintained by NVIDIA (cudatools@nvidia.com).
(Answer: Abort installation)
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.
运行 apt-cache search nvidia | grep -P '^nvidia-[0-9]+\s'
给出:
nvidia-331 - Transitional package for nvidia-331
nvidia-346 - Transitional package for nvidia-346
nvidia-304 - NVIDIA legacy binary driver - version 304.135
nvidia-340 - NVIDIA binary driver - version 340.107
nvidia-361 - Transitional package for nvidia-367
nvidia-352 - Transitional package for nvidia-375
nvidia-367 - Transitional package for nvidia-387
nvidia-375 - Transitional package for nvidia-418
nvidia-387 - NVIDIA binary driver - version 387.26
nvidia-418 - NVIDIA binary driver - version 418.39
nvidia-384 - NVIDIA binary driver - version 384.183
nvidia-390 - NVIDIA binary driver - version 390.116
nvidia-410 - NVIDIA binary driver - version 410.104
nvidia-396 - NVIDIA binary driver - version 396.82
我通过更新到最新的 Nvidia 驱动程序解决了这个问题。使用:
nvcc --version
获取cuda工具包版本号。对于 9.0,最新的驱动程序是 384.183,对于 CUDA 10.0,最新的驱动程序是 410.104。
然后运行:
wget http://us.download.nvidia.com/tesla/384.183/NVIDIA-Linux-x86_64-384.183.run
下载驱动程序。
然后运行:
sudo sh ./NVIDIA-Linux-x86_64-384.183.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd
安装驱动程序。
运行:
nvidia-smi
检查问题是否已解决。
#!/bin/bash
set -x
version=
#version=410.79
#version=410.104
wget http://us.download.nvidia.com/tesla/${version}/NVIDIA-Linux-x86_64-${version}.run
sudo sh ./NVIDIA-Linux-x86_64-${version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd
- 将上面的内容另存为
install.sh
。
sh install.sh 410.104
sudo modprobe nvidia
GPU 应该马上回来,检查 nvidia-smi
对于多 cuda 安装,请选择您打算使用的 cuda 版本。然后按从早到晚的顺序安装它们。对于 cuda 版本 9.0,最新的驱动程序是 384.183,9.1 是 390.116,CUDA 10.0 是 410.104。
您可以在以下网站找到名称,但不要使用.deb 文件。
$ cd /usr/local
$ sudo rm cuda
$ sudo ln -s cuda-{$cuda_version} cuda
wget http://us.download.nvidia.com/tesla/${nvidia_version}/NVIDIA-Linux-x86_64-${nvidia_version}.run
>sudo sh ./NVIDIA-Linux-x86_64-${nvidia_version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd
虽然重新安装驱动程序可以使驱动程序正常工作,但这并不能解决问题,也不是该问题的正确答案。
我在 ubuntu 上观察到同样的问题,重新安装驱动程序是一种解决方法,直到它再次出现故障的那一天。这种自发的 nvidia cuda 驱动程序故障的原因是 ubuntu 的自动安全更新。当有重建内核的更新时,它会破坏 cuda 驱动程序并且 nvidia-smi
不会与驱动程序通信。
一个简单的解决方案是禁用自动安全更新:
sudo apt -y remove unattended-upgrades
这对我有用:
sudo apt purge nvidia-driver-450
sudo apt autoremove
我一直在使用带有 Tesla K80 GPU 的 AWS EC2 实例 运行 TensorFlow 代码。 我安装了 CUDA 9.0 和 cuDNN 7.1.4,我正在使用 TF 1.12,所有这些都在 Ubuntu 16.04
昨天之前一切正常,但今天 NVidia 驱动程序似乎由于某种原因停止了 运行ning :
ubuntu@ip-10-0-0-13:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
我检查了驱动程序:
ubuntu@ip-10-0-0-13:~$ dpkg -l | grep nvidia
rc nvidia-367 367.48-0ubuntu1 amd64 NVIDIA binary driver - version 367.48
ii nvidia-396 396.37-0ubuntu1 amd64 NVIDIA binary driver - version 396.37
ii nvidia-396-dev 396.37-0ubuntu1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-machine-learning-repo-ubuntu1604 1.0.0-1 amd64 nvidia-machine-learning repository configuration files
ii nvidia-modprobe 396.37-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
rc nvidia-opencl-icd-367 367.48-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-opencl-icd-396 396.37-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.8.2 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 396.37-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
似乎有 2 个不同的版本,这会是个问题吗? (但我不明白为什么之前一切正常)。
找到 this thread,我检查了我的内核,它与线程中提到的内核明显不同:
ubuntu@ip-10-0-0-13:~$ uname -a
Linux ip-10-0-0-13 4.4.0-143-generic #169-Ubuntu SMP Thu Feb 7 07:56:38 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
有没有人 运行 遇到过这个问题并且知道如何解决它? 提前感谢您的帮助!
编辑:
尝试使用@Dehydrated_Mud 的方法升级驱动程序时,出现以下错误:
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.
以及日志文件的内容:
nvidia-installer log file '/var/log/nvidia-installer.log'
creation time: Thu Mar 21 10:56:46 2019
installer version: 384.183
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
nvidia-installer command line:
./nvidia-installer
--no-drm
--disable-nouveau
--dkms
--silent
--install-libglvnd
Using built-in stream user interface
-> Detected 4 CPUs online; setting concurrency level to 4.
-> Installing NVIDIA driver version 384.183.
-> The NVIDIA driver appears to have been installed previously using a different installer. To prevent potential conflicts, it is recommended either to update the existing installation using the same mechanism by which it was originally installed, or to uninstall the existing installation before installing this driver.
Please review the message provided by the maintainer of this alternate installation method and decide how to proceed:
The package that is already installed is named nvidia-396.
You can upgrade the driver by running:
`apt-get install nvidia-396 nvidia-modprobe nvidia-settings`
You can remove nvidia-396, and all related packages, by running:
`apt-get remove --purge nvidia-396 nvidia-modprobe nvidia-settings`
This package is maintained by NVIDIA (cudatools@nvidia.com).
(Answer: Abort installation)
ERROR: The installation was canceled due to the availability or presence of an alternate driver installation. Please see /var/log/nvidia-installer.log for more details.
运行 apt-cache search nvidia | grep -P '^nvidia-[0-9]+\s'
给出:
nvidia-331 - Transitional package for nvidia-331
nvidia-346 - Transitional package for nvidia-346
nvidia-304 - NVIDIA legacy binary driver - version 304.135
nvidia-340 - NVIDIA binary driver - version 340.107
nvidia-361 - Transitional package for nvidia-367
nvidia-352 - Transitional package for nvidia-375
nvidia-367 - Transitional package for nvidia-387
nvidia-375 - Transitional package for nvidia-418
nvidia-387 - NVIDIA binary driver - version 387.26
nvidia-418 - NVIDIA binary driver - version 418.39
nvidia-384 - NVIDIA binary driver - version 384.183
nvidia-390 - NVIDIA binary driver - version 390.116
nvidia-410 - NVIDIA binary driver - version 410.104
nvidia-396 - NVIDIA binary driver - version 396.82
我通过更新到最新的 Nvidia 驱动程序解决了这个问题。使用:
nvcc --version
获取cuda工具包版本号。对于 9.0,最新的驱动程序是 384.183,对于 CUDA 10.0,最新的驱动程序是 410.104。
然后运行:
wget http://us.download.nvidia.com/tesla/384.183/NVIDIA-Linux-x86_64-384.183.run
下载驱动程序。
然后运行:
sudo sh ./NVIDIA-Linux-x86_64-384.183.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd
安装驱动程序。
运行:
nvidia-smi
检查问题是否已解决。
#!/bin/bash
set -x
version=
#version=410.79
#version=410.104
wget http://us.download.nvidia.com/tesla/${version}/NVIDIA-Linux-x86_64-${version}.run
sudo sh ./NVIDIA-Linux-x86_64-${version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd
- 将上面的内容另存为
install.sh
。 sh install.sh 410.104
sudo modprobe nvidia
GPU 应该马上回来,检查 nvidia-smi
对于多 cuda 安装,请选择您打算使用的 cuda 版本。然后按从早到晚的顺序安装它们。对于 cuda 版本 9.0,最新的驱动程序是 384.183,9.1 是 390.116,CUDA 10.0 是 410.104。
您可以在以下网站找到名称,但不要使用.deb 文件。
$ cd /usr/local
$ sudo rm cuda
$ sudo ln -s cuda-{$cuda_version} cuda
wget http://us.download.nvidia.com/tesla/${nvidia_version}/NVIDIA-Linux-x86_64-${nvidia_version}.run
>sudo sh ./NVIDIA-Linux-x86_64-${nvidia_version}.run --no-drm --disable-nouveau --dkms --silent --install-libglvnd
虽然重新安装驱动程序可以使驱动程序正常工作,但这并不能解决问题,也不是该问题的正确答案。
我在 ubuntu 上观察到同样的问题,重新安装驱动程序是一种解决方法,直到它再次出现故障的那一天。这种自发的 nvidia cuda 驱动程序故障的原因是 ubuntu 的自动安全更新。当有重建内核的更新时,它会破坏 cuda 驱动程序并且 nvidia-smi
不会与驱动程序通信。
一个简单的解决方案是禁用自动安全更新:
sudo apt -y remove unattended-upgrades
这对我有用:
sudo apt purge nvidia-driver-450
sudo apt autoremove