无法在 GKE 中使用 GPU 运行 pods:2 nvidia 不足。com/gpu 错误

Not able to run pods with GPU in GKE: 2 insufficient nvidia.com/gpu error

我们 followed this guide 在现有集群中使用支持 GPU 的节点,但是当我们尝试安排 pods 时,我们得到 2 Insufficient nvidia.com/gpu error

详情:

我们正在尝试在我们现有的集群中使用 GPU,为此我们能够成功地创建一个节点池,其中包含一个启用了 GPU 的节点。

然后下一步,根据上面的指南,我们必须创建一个守护进程,我们也能够 运行 DS 成功。

但是现在,当我们尝试使用以下资源部分安排 Pod 时,该 Pod 变得无法安排并出现此错误 2 insufficient nvidia.com/gpu

    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        cpu: 200m
        memory: 3Gi

规格:

Node version - v1.18.17-gke.700 (+ v1.17.17-gke.6000) tried on both
Instance type - n1-standard-4
image - cos
GPU - NVIDIA Tesla T4

任何进一步调试此问题的帮助或指示将不胜感激。

TIA,


kubectl get node <gpu-node> -o yaml 的输出[已编辑]

apiVersion: v1
kind: Node
metadata:
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: n1-standard-4
    beta.kubernetes.io/os: linux
    cloud.google.com/gke-accelerator: nvidia-tesla-t4
    cloud.google.com/gke-boot-disk: pd-standard
    cloud.google.com/gke-container-runtime: docker
    cloud.google.com/gke-nodepool: gpu-node
    cloud.google.com/gke-os-distribution: cos
    cloud.google.com/machine-family: n1
    failure-domain.beta.kubernetes.io/region: us-central1
    failure-domain.beta.kubernetes.io/zone: us-central1-b
    kubernetes.io/arch: amd64
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: n1-standard-4
    topology.kubernetes.io/region: us-central1
    topology.kubernetes.io/zone: us-central1-b
  name: gke-gpu-node-d6ddf1f6-0d7j
spec:
  taints:
  - effect: NoSchedule
    key: nvidia.com/gpu
    value: present
status:
  ...
  allocatable:
    attachable-volumes-gce-pd: "127"
    cpu: 3920m
    ephemeral-storage: "133948343114"
    hugepages-2Mi: "0"
    memory: 12670032Ki
    pods: "110"
  capacity:
    attachable-volumes-gce-pd: "127"
    cpu: "4"
    ephemeral-storage: 253696108Ki
    hugepages-2Mi: "0"
    memory: 15369296Ki
    pods: "110"
  conditions:
    ...
  nodeInfo:
    architecture: amd64
    containerRuntimeVersion: docker://19.3.14
    kernelVersion: 5.4.89+
    kubeProxyVersion: v1.18.17-gke.700
    kubeletVersion: v1.18.17-gke.700
    operatingSystem: linux
    osImage: Container-Optimized OS from Google

部署的容忍度

  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

nvidia-gpu-device-plugin 也应该安装在 GPU 节点中。您应该在 kube-system 命名空间中看到 nvidia-gpu-device-plugin DaemonSet。

它应该由Google自动部署,但是如果你想自己部署它,运行下面的命令:kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

它将在节点中安装 GPU 插件,之后您的 pods 将能够使用它。

完成@hilesenrat 的回答,虽然不完全适合我的情况,但它让我找到了解决方案。

实际上在我的例子中,插件 daemonset 已经安装,但是 pods 不会因为卷错误而启动。

kubectl get pods -n kube-system | grep -i nvidia
nvidia-gpu-device-plugin-cbk9m                      0/1    ContainerCreating   0          22h
nvidia-gpu-device-plugin-gt5vf                      0/1    ContainerCreating   0          22h
nvidia-gpu-device-plugin-mgrr5                      0/1    ContainerCreating   0          22h
nvidia-gpu-device-plugin-vt474                      0/1    ContainerCreating   0          22h

 kubectl describe pods nvidia-gpu-device-plugin-cbk9m -n kube-system
 ...
 Warning  FailedMount  5m1s (x677 over 22h)   kubelet  MountVolume.SetUp failed for volume "nvidia" : hostPath type check failed: /home/kubernetes/bin/nvidia is not a directory

事实上,根据 google cloud documentation NVIDIA 设备驱动程序需要安装在这些节点上。一旦安装,情况就畅通无阻了。

kubectl get pods -n kube-system | grep -i nvidia
nvidia-driver-installer-6mxr2                       1/1     Running   0          2m33s
nvidia-driver-installer-8lww7                       1/1     Running   0          2m33s
nvidia-driver-installer-m748p                       1/1     Running   0          2m33s
nvidia-driver-installer-r4x8c                       1/1     Running   0          2m33s
nvidia-gpu-device-plugin-cbk9m                      1/1     Running   0          22h
nvidia-gpu-device-plugin-gt5vf                      1/1     Running   0          22h
nvidia-gpu-device-plugin-mgrr5                      1/1     Running   0          22h
nvidia-gpu-device-plugin-vt474                      1/1     Running   0          22h

使用 nvidia-gpu-device-plugin 是您应该尝试的第一件事,但还有一些其他要求需要满足和确保

  1. 确保将设备插件添加到配置中/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
Environment="KUBELET_EXTRA_ARGS=--feature-gates=DevicePlugins=true"
  1. Docker 配置应如下所示
{
    "exec-opts": ["native.cgroupdriver=systemd"],
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

最后但同样重要的是,如果您仍然面临同样的问题,请确保您使用的是 NVIDIA 发布的官方基础映像

当我尝试在 docker 构建中使用 cuda 在 ubuntu 上自定义安装 PyTorch 时,图像已成功构建,但应用程序无法检测到 CUDA 所以我建议你使用 NVIDIA

的官方构建镜像