使用 `CLONE_NEWUSER|CLONE_NEWNS` 调用克隆时,挂载传播的行为如何?

How does mount propagation behave when calling clone with `CLONE_NEWUSER|CLONE_NEWNS`?

我的程序调用clone并在子进程中调用/bin/sh

在shell,我运行cat /proc/$$/mountinfo看传播归属。 如果标志是 CLONE_NEWNS,我得到这个:

# cat /proc/$$/mountinfo
194 193 8:1 / / rw,relatime shared:1 - ext4 /dev/sda1 rw,discard,errors=remount-ro
...

如果结合 CLONE_NEWNSCLONE_NEWUSER(在以下来源中取消注释 flags |= CLONE_NEWUSER;),我得到了这个:

199 198 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 rw,discard,errors=remount-ro
...

为什么 CLONE_NEWUSER 会有所不同?在我的机器(Debian 9)上,它应该总是 MS_SHARED 因为它是从 MS_SHARED 安装点创建的。

#define _GNU_SOURCE
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)

static char container_stack[STACK_SIZE];

char *const container_args[] = {"/bin/sh", NULL};

int container_main(void *arg) {
  printf("Container - inside the container!\n");
  printf("container pid is %d\n", getpid());
  int status = execv(container_args[0], container_args);
  if (status < 0) perror("execv");
  printf("Something's wrong!\n");
  return 0;
}

int main() {
  printf("Parent [ %d ] - start a container!\n", getpid());

  int flags = CLONE_NEWNS;
  //flags |= CLONE_NEWUSER;

  int container_pid = clone(container_main, container_stack + STACK_SIZE,
                            SIGCHLD | flags, NULL);
  if (container_pid < 0) {
    perror("clone");
    return -1;
  }

  printf("Container pid is %d\n", container_pid);
  waitpid(container_pid, NULL, 0);
  printf("Parent - container stopped!\n");
  return 0;
}

man 7 mount_namespaces 解释了。相关摘录:

   *  Each mount namespace has an owner user namespace.  As
      explained above, when a new mount namespace is created, its
      mount point list is initialized as a copy of the mount point
      list of another mount namespace.  If the new namespace and the
      namespace from which the mount point list was copied are owned
      by different user namespaces, then the new mount namespace is
      considered less privileged.

   *  When creating a less privileged mount namespace, shared mounts
      are reduced to slave mounts.  (Shared and slave mounts are
      discussed below.)  This ensures that mappings performed in
      less privileged mount namespaces will not propagate to more
      privileged mount namespaces

   shared:X
          This mount point is shared in peer group X.  Each peer
          group has a unique ID that is automatically generated by
          the kernel, and all mount points in the same peer group
          will show the same ID.  (These IDs are assigned starting
          from the value 1, and may be recycled when a peer group
          ceases to have any members.)

   master:X
          This mount is a slave to shared peer group X.