无法在 GKE 中使用 GPU 运行 pods:2 nvidia 不足。com/gpu 错误
Not able to run pods with GPU in GKE: 2 insufficient nvidia.com/gpu error
我们 followed this guide 在现有集群中使用支持 GPU 的节点,但是当我们尝试安排 pods 时,我们得到 2 Insufficient nvidia.com/gpu error
详情:
我们正在尝试在我们现有的集群中使用 GPU,为此我们能够成功地创建一个节点池,其中包含一个启用了 GPU 的节点。
然后下一步,根据上面的指南,我们必须创建一个守护进程,我们也能够 运行 DS 成功。
但是现在,当我们尝试使用以下资源部分安排 Pod 时,该 Pod 变得无法安排并出现此错误 2 insufficient nvidia.com/gpu
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: 200m
memory: 3Gi
规格:
Node version - v1.18.17-gke.700 (+ v1.17.17-gke.6000) tried on both
Instance type - n1-standard-4
image - cos
GPU - NVIDIA Tesla T4
任何进一步调试此问题的帮助或指示将不胜感激。
TIA,
kubectl get node <gpu-node> -o yaml
的输出[已编辑]
apiVersion: v1
kind: Node
metadata:
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: n1-standard-4
beta.kubernetes.io/os: linux
cloud.google.com/gke-accelerator: nvidia-tesla-t4
cloud.google.com/gke-boot-disk: pd-standard
cloud.google.com/gke-container-runtime: docker
cloud.google.com/gke-nodepool: gpu-node
cloud.google.com/gke-os-distribution: cos
cloud.google.com/machine-family: n1
failure-domain.beta.kubernetes.io/region: us-central1
failure-domain.beta.kubernetes.io/zone: us-central1-b
kubernetes.io/arch: amd64
kubernetes.io/os: linux
node.kubernetes.io/instance-type: n1-standard-4
topology.kubernetes.io/region: us-central1
topology.kubernetes.io/zone: us-central1-b
name: gke-gpu-node-d6ddf1f6-0d7j
spec:
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: present
status:
...
allocatable:
attachable-volumes-gce-pd: "127"
cpu: 3920m
ephemeral-storage: "133948343114"
hugepages-2Mi: "0"
memory: 12670032Ki
pods: "110"
capacity:
attachable-volumes-gce-pd: "127"
cpu: "4"
ephemeral-storage: 253696108Ki
hugepages-2Mi: "0"
memory: 15369296Ki
pods: "110"
conditions:
...
nodeInfo:
architecture: amd64
containerRuntimeVersion: docker://19.3.14
kernelVersion: 5.4.89+
kubeProxyVersion: v1.18.17-gke.700
kubeletVersion: v1.18.17-gke.700
operatingSystem: linux
osImage: Container-Optimized OS from Google
部署的容忍度
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
nvidia-gpu-device-plugin
也应该安装在 GPU 节点中。您应该在 kube-system
命名空间中看到 nvidia-gpu-device-plugin
DaemonSet。
它应该由Google自动部署,但是如果你想自己部署它,运行下面的命令:kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
它将在节点中安装 GPU 插件,之后您的 pods 将能够使用它。
完成@hilesenrat 的回答,虽然不完全适合我的情况,但它让我找到了解决方案。
实际上在我的例子中,插件 daemonset 已经安装,但是 pods 不会因为卷错误而启动。
kubectl get pods -n kube-system | grep -i nvidia
nvidia-gpu-device-plugin-cbk9m 0/1 ContainerCreating 0 22h
nvidia-gpu-device-plugin-gt5vf 0/1 ContainerCreating 0 22h
nvidia-gpu-device-plugin-mgrr5 0/1 ContainerCreating 0 22h
nvidia-gpu-device-plugin-vt474 0/1 ContainerCreating 0 22h
kubectl describe pods nvidia-gpu-device-plugin-cbk9m -n kube-system
...
Warning FailedMount 5m1s (x677 over 22h) kubelet MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory
事实上,根据 google cloud documentation NVIDIA 设备驱动程序需要安装在这些节点上。一旦安装,情况就畅通无阻了。
kubectl get pods -n kube-system | grep -i nvidia
nvidia-driver-installer-6mxr2 1/1 Running 0 2m33s
nvidia-driver-installer-8lww7 1/1 Running 0 2m33s
nvidia-driver-installer-m748p 1/1 Running 0 2m33s
nvidia-driver-installer-r4x8c 1/1 Running 0 2m33s
nvidia-gpu-device-plugin-cbk9m 1/1 Running 0 22h
nvidia-gpu-device-plugin-gt5vf 1/1 Running 0 22h
nvidia-gpu-device-plugin-mgrr5 1/1 Running 0 22h
nvidia-gpu-device-plugin-vt474 1/1 Running 0 22h
使用 nvidia-gpu-device-plugin
是您应该尝试的第一件事,但还有一些其他要求需要满足和确保
- 确保将设备插件添加到配置中
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
Environment="KUBELET_EXTRA_ARGS=--feature-gates=DevicePlugins=true"
- Docker 配置应如下所示
{
"exec-opts": ["native.cgroupdriver=systemd"],
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
最后但同样重要的是,如果您仍然面临同样的问题,请确保您使用的是 NVIDIA 发布的官方基础映像
当我尝试在 docker 构建中使用 cuda 在 ubuntu 上自定义安装 PyTorch 时,图像已成功构建,但应用程序无法检测到 CUDA
所以我建议你使用 NVIDIA
的官方构建镜像
我们 followed this guide 在现有集群中使用支持 GPU 的节点,但是当我们尝试安排 pods 时,我们得到 2 Insufficient nvidia.com/gpu error
详情:
我们正在尝试在我们现有的集群中使用 GPU,为此我们能够成功地创建一个节点池,其中包含一个启用了 GPU 的节点。
然后下一步,根据上面的指南,我们必须创建一个守护进程,我们也能够 运行 DS 成功。
但是现在,当我们尝试使用以下资源部分安排 Pod 时,该 Pod 变得无法安排并出现此错误 2 insufficient nvidia.com/gpu
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: 200m
memory: 3Gi
规格:
Node version - v1.18.17-gke.700 (+ v1.17.17-gke.6000) tried on both
Instance type - n1-standard-4
image - cos
GPU - NVIDIA Tesla T4
任何进一步调试此问题的帮助或指示将不胜感激。
TIA,
kubectl get node <gpu-node> -o yaml
的输出[已编辑]
apiVersion: v1
kind: Node
metadata:
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: n1-standard-4
beta.kubernetes.io/os: linux
cloud.google.com/gke-accelerator: nvidia-tesla-t4
cloud.google.com/gke-boot-disk: pd-standard
cloud.google.com/gke-container-runtime: docker
cloud.google.com/gke-nodepool: gpu-node
cloud.google.com/gke-os-distribution: cos
cloud.google.com/machine-family: n1
failure-domain.beta.kubernetes.io/region: us-central1
failure-domain.beta.kubernetes.io/zone: us-central1-b
kubernetes.io/arch: amd64
kubernetes.io/os: linux
node.kubernetes.io/instance-type: n1-standard-4
topology.kubernetes.io/region: us-central1
topology.kubernetes.io/zone: us-central1-b
name: gke-gpu-node-d6ddf1f6-0d7j
spec:
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: present
status:
...
allocatable:
attachable-volumes-gce-pd: "127"
cpu: 3920m
ephemeral-storage: "133948343114"
hugepages-2Mi: "0"
memory: 12670032Ki
pods: "110"
capacity:
attachable-volumes-gce-pd: "127"
cpu: "4"
ephemeral-storage: 253696108Ki
hugepages-2Mi: "0"
memory: 15369296Ki
pods: "110"
conditions:
...
nodeInfo:
architecture: amd64
containerRuntimeVersion: docker://19.3.14
kernelVersion: 5.4.89+
kubeProxyVersion: v1.18.17-gke.700
kubeletVersion: v1.18.17-gke.700
operatingSystem: linux
osImage: Container-Optimized OS from Google
部署的容忍度
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
nvidia-gpu-device-plugin
也应该安装在 GPU 节点中。您应该在 kube-system
命名空间中看到 nvidia-gpu-device-plugin
DaemonSet。
它应该由Google自动部署,但是如果你想自己部署它,运行下面的命令:kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
它将在节点中安装 GPU 插件,之后您的 pods 将能够使用它。
完成@hilesenrat 的回答,虽然不完全适合我的情况,但它让我找到了解决方案。
实际上在我的例子中,插件 daemonset 已经安装,但是 pods 不会因为卷错误而启动。
kubectl get pods -n kube-system | grep -i nvidia
nvidia-gpu-device-plugin-cbk9m 0/1 ContainerCreating 0 22h
nvidia-gpu-device-plugin-gt5vf 0/1 ContainerCreating 0 22h
nvidia-gpu-device-plugin-mgrr5 0/1 ContainerCreating 0 22h
nvidia-gpu-device-plugin-vt474 0/1 ContainerCreating 0 22h
kubectl describe pods nvidia-gpu-device-plugin-cbk9m -n kube-system
...
Warning FailedMount 5m1s (x677 over 22h) kubelet MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory
事实上,根据 google cloud documentation NVIDIA 设备驱动程序需要安装在这些节点上。一旦安装,情况就畅通无阻了。
kubectl get pods -n kube-system | grep -i nvidia
nvidia-driver-installer-6mxr2 1/1 Running 0 2m33s
nvidia-driver-installer-8lww7 1/1 Running 0 2m33s
nvidia-driver-installer-m748p 1/1 Running 0 2m33s
nvidia-driver-installer-r4x8c 1/1 Running 0 2m33s
nvidia-gpu-device-plugin-cbk9m 1/1 Running 0 22h
nvidia-gpu-device-plugin-gt5vf 1/1 Running 0 22h
nvidia-gpu-device-plugin-mgrr5 1/1 Running 0 22h
nvidia-gpu-device-plugin-vt474 1/1 Running 0 22h
使用 nvidia-gpu-device-plugin
是您应该尝试的第一件事,但还有一些其他要求需要满足和确保
- 确保将设备插件添加到配置中
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
Environment="KUBELET_EXTRA_ARGS=--feature-gates=DevicePlugins=true"
- Docker 配置应如下所示
{
"exec-opts": ["native.cgroupdriver=systemd"],
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
最后但同样重要的是,如果您仍然面临同样的问题,请确保您使用的是 NVIDIA 发布的官方基础映像
当我尝试在 docker 构建中使用 cuda 在 ubuntu 上自定义安装 PyTorch 时,图像已成功构建,但应用程序无法检测到 CUDA 所以我建议你使用 NVIDIA
的官方构建镜像