为什么我不能 运行 IPU 在 Docker 容器中作为非根用户编程?

Why can’t I run IPU programs as non-root in Docker containers?

我正在尝试 运行 作为来自 Graphcore 的 TensorFlow 1.5 Docker 图像的非根用户从 Graphcore’s examples repo 进行 CNN 训练,但它正在抛出:

2020-04-23 11:17:32.960014: I tensorflow/compiler/jit/xla_compilation_cache.cc:250] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.Saved checkpointto ./logs/RN152_bs1x16p_GN32_16.16_v1.1.11_6LT/ckpt-0
2020-04-23 11:19:07.615030: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at xla_ops.cc:361 : Unknown: [Error][Build graph] could not get temporary file for model 'MappedCodelet_%%%%%%%%%%%%%%.cpp': Permission denied
Traceback (most recent call last): 
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list,run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: [Error][Build graph] could not get temporary file for model 'MappedCodelet_%%%%%%%%%%%%%%.cpp': Permission denied
[[{{node cluster}}]]

当我以 root 用户身份 运行 时,程序运行良好,但当我创建新用户时,它开始抛出此错误。这是否意味着 Graphcore 的 Docker 图像仅在您使用 root 时才有效?

可以 运行 IPU 程序作为非 root 用户。您看到此行为的原因是因为在 运行ning Docker 容器(以及任何基于 Ubuntu 的环境)中切换用户会导致环境变量被重置。这些环境变量包含附加到 IPU 和 运行 IPU 上的程序所需的重要 IPU 配置设置。您可以通过在 Docker 文件中进行用户管理来避免这种行为。下面是一个示例片段(其中 exampleshttps://github.com/graphcore/examples/ 的克隆):

FROM graphcore/tensorflow:1 
ENV LC_ALL=C.UTF-8 
ENV LANG=C.UTF-8 
RUN adduser [username]   
ADD examples examples 
RUN chown [username] -R examples 

然后你可以构建镜像:

docker image build . -t graphcore-examples 

现在您有 3 个选项可以 运行 作为非 root 用户进行 CNN 训练:

  1. 运行 CNN直接训练:
gc-docker -- -ti -u [username] graphcore-examples python3 /examples/applications/tensorflow/cnns/training/train.py 
  1. 以非 root 用户身份将容器启动到 bash shell,然后 运行 从那里开始训练:
gc-docker -- -ti -u [username] graphcore-examples 
$ python3 /examples/applications/tensorflow/cnns/training/train.py 
  1. 以 root 身份启动容器,然后在切换用户时保留环境:
gc-docker -- -ti graphcore-examples 
$ su --preserve-environment - [username] 
$ python3 /examples/applications/tensorflow/cnns/training/train.py 

我建议尽可能使用选项 1 或 2。您可以找到有关 gc-docker 命令行工具 here.

的更多信息