如何在 google 计算引擎上运行 tensorflow GPU 容器？

Question

我正尝试在 google 带有 GPU 加速器的计算引擎上运行 tensorflow 容器。

尝试了命令

gcloud compute instances create-with-container job-name \
  --machine-type=n1-standard-4 \
  --accelerator=type=nvidia-tesla-k80 \
  --image-project=deeplearning-platform-release \
  --image-family=common-container \
  --container image gcr/io/my-container \
  --container-arg="--container-arguments=xxxx"

但是收到警告

WARNING: This container deployment mechanism requires a Container-Optimized OS image in order to work. Select an image from a cos-cloud project (cost-stable, cos-beta, cos-dev image families).

我还尝试了来自 cos-cloud 项目的系统映像，它似乎没有 CUDA 驱动程序，因为 tensorflow 记录警告 cuInit failed。

想知道在 google 具有 GPU 支持的计算引擎上运行 tensorflow 容器的正确方法是什么？

Answer 1

你考虑过Cloud TPU on GKE吗？

这篇page介绍了如何使用 GPU 设置 GKE 集群

Answer 2

您可以 docker run 您的容器在 deeplearningvm 的 startup-script 之内。


gcloud beta compute instances create deeplearningvm-$(date +"%Y%m%d-%H%M%S") \
--zone=us-central1-c \
--machine-type=n1-standard-8 \
--subnet=default \
--service-account=<your google service account> \
--scopes='https://www.googleapis.com/auth/cloud-platform' \
--accelerator=type=nvidia-tesla-k80,count=1 \
--image-project=deeplearning-platform-release \
--image-family=tf-latest-gpu \
--maintenance-policy=TERMINATE \
--metadata=install-nvidia-driver=True,startup-script='#!/bin/bash

# Check the driver until installed
while ! [[ -x "$(command -v nvidia-smi)" ]];
do
  echo "sleep to check"
  sleep 5s
done
echo "nvidia-smi is installed"

gcloud auth configure-docker
echo "Docker run with GPUs"
docker run --gpus all --log-driver=gcplogs --rm gcr.io/<your container>

echo "Kill VM $(hostname)"
gcloud compute instances delete $(hostname) --zone \
$(curl -H Metadata-Flavor:Google http://metadata.google.internal/computeMetadata/v1/instance/zone -s | cut -d/ -f4) -q

'

由于安装 nvidia 驱动程序需要几分钟时间，因此您必须等到安装完成后再启动您的容器。 https://cloud.google.com/ai-platform/deep-learning-vm/docs/tensorflow_start_instance#creating_a_tensorflow_instance_from_the_command_line

Compute Engine loads the latest stable driver on the first boot and performs the necessary steps (including a final reboot to activate the driver). It may take up to 5 minutes before your VM is fully provisioned. In this time, you will be unable to SSH into your machine. When the installation is complete, to guarantee that the driver installation was successful, you can SSH in and run nvidia-smi.

如何在 google 计算引擎上运行 tensorflow GPU 容器？

How to run tensorflow GPU container on google compute engines?

google-compute-engine

google-cloud-platform

tensorflow

如何在 google 计算引擎上 运行 tensorflow GPU 容器？

How to run tensorflow GPU container on google compute engines?

google-compute-engine

google-cloud-platform

tensorflow

如何在 google 计算引擎上运行 tensorflow GPU 容器？