GCP - 无法通过 SSH 进入新的 GPU 深度学习 VM 实例

GCP - Cannot SSH into fresh GPU Deep Learning VM instance

如果我使用 GPU 和 GPU 优化的 Debian 映像创建一个新的 GCE VM 实例,我无法通过 SSH 连接到它,无论是通过浏览器 SSH window 还是使用第三方 SSH 客户端(上传后) public键)。

我已经尝试了 here 的建议,但没有用。

如果我在没有 GPU 的情况下使用标准 Ubuntu 图像创建实例,开箱即用一切正常。

关于 GPU 深度学习实例,我是否遗漏了什么?

编辑:

重新创建的 GCloud 命令:

gcloud beta compute --project=avid-compound-233309 instances create instance-1 --zone=us-central1-a --machine-type=n1-standard-1 --subnet=default --network-tier=PREMIUM --maintenance-policy=TERMINATE --service-account=105060870131-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --accelerator=type=nvidia-tesla-k80,count=1 --image=c0-common-gce-gpu-image-20191213 --image-project=ml-images --boot-disk-size=50GB --boot-disk-type=pd-standard --boot-disk-device-name=instance-1 --reservation-affinity=any

是的,它发生在创建 VM 之后,串行端口 1 日志中有大量错误日志,简短示例:

[    9.393769] google_accounts_daemon[692]:     File "<frozen importlib._bootstrap>", line 574, in module_from_spec
[    9.394022] google_accounts_daemon[692]:   AttributeError: 'NoneType' object has no attribute 'loader'
[    9.394250] google_accounts_daemon[692]: Remainder of file ignored
[    9.394504] google_accounts_daemon[692]: Traceback (most recent call last):
[    9.394767] google_accounts_daemon[692]:   File "/usr/bin/google_accounts_daemon", line 6, in <module>
[    9.395108] google_accounts_daemon[692]:     from pkg_resources import load_entry_point
[    9.395344] google_accounts_daemon[692]:   File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 57, in <module>
[    9.395502] google_accounts_daemon[692]:     from pkg_resources.extern import six
[    9.395719] google_accounts_daemon[692]: ImportError: No module named 'pkg_resources.extern'
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     File "/usr/lib/python3.5/site.py", line 173, in addpackage
Dec 23 19:40:05 localhost google_accounts_daemon[692]:       exec(line)
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     File "<string>", line 1, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     File "<frozen importlib._bootstrap>", line 574, in module_from_spec
Dec 23 19:40:05 localhost google_accounts_daemon[692]:   AttributeError: 'NoneType' object has no attribute 'loader'
Dec 23 19:40:05 localhost google_accounts_daemon[692]: Remainder of file ignored
Dec 23 19:40:05 localhost google_accounts_daemon[692]: Traceback (most recent call last):
Dec 23 19:40:05 localhost google_accounts_daemon[692]:   File "/usr/bin/google_accounts_daemon", line 6, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     from pkg_resources import load_entry_point
Dec 23 19:40:05 localhost google_accounts_daemon[692]:   File "/usr/local/lib/python3.5/dist-packages/pkg_resources/__init__.py", line 57, in <module>
Dec 23 19:40:05 localhost google_accounts_daemon[692]:     from pkg_resources.extern import six
Dec 23 19:40:05 localhost google_accounts_daemon[692]: ImportError: No module named 'pkg_resources.extern'

新发布的映像 "GPU Optimized Debian m32 (with CUDA 10.0) (c0-common-gce-gpu-image-20191213)" 似乎包含损坏的 EXT 文件系统。目录、配置和脚本文件包含垃圾。因此,首次启动时的初始配置失败。

Started Flush Journal to Persistent Storage.
Starting Create Volatile Files and Directories...
[ 4.880071] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 144, inode_bitmap = 4718608
[ 4.883559] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 145, inode_bitmap = 4718609
[ 4.887054] EXT4-fs error (device sda1): ext4_validate_inode_bitmap:98: comm systemd-tmpfile: Corrupt inode bitmap - block_group = 146, inode_bitmap = 4718610
...
localhost ssh-generate-hostkeys[485]: /etc/ssh/ssh_host_ecdsa_key.pub is not a public key file.
localhost dhclient[516]: 
localhost ssh-generate-hostkeys[485]: /etc/ssh/ssh_host_ed25519_key.pub is not a public key file.
localhost ssh-generate-hostk[ [0;32m  OK   [0m] Started Getty on tty1.
...
keys[485]: /etc/ssh/ssh_host_rsa_key.pub is not a public key file.

Public 问题跟踪器上有一个最近创建的 public 问题:https://issuetracker.google.com/146807209

应该会尽快修复。