Kubernetes: runContainer: API error (500): Cannot start container (docker failed to umount)

Kubernetes: runContainer: API error (500): Cannot start container (docker failed to umount)

有时在我们的 GKE 集群上 pod 创建失败并出现 500 错误:

1m        1m        1         installer-u57ab1f7707b03   Pod                 Normal    Scheduled    {default-scheduler }                                       Successfully assigned installer-u57ab1f7707b03 to gke-oro-cloud-v1-1445426963-ffbcc283-node-bo1l
1m        1m        1         installer-u57ab1f7707b03   Pod                 Warning   FailedSync   {kubelet gke-oro-cloud-v1-1445426963-ffbcc283-node-bo1l}   Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container ff8573fbf0b90a25b5565b1feb36671f13367115dde74e581cf249be772d8e4e: [8] System error: read parent: connection reset by peer\n"
1m        1m        1         installer-u57ab1f7707b03   Pod                 Warning   FailedSync   {kubelet gke-oro-cloud-v1-1445426963-ffbcc283-node-bo1l}   Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container fbd7151d4489ed3ac9b21ef9ee3268039374fe3aee1f5933dc27d003f5388e7d: [8] System error: read parent: connection reset by peer\n"
1m        1m        1         installer-u57ab1f7707b03   Pod                 Warning   FailedSync   {kubelet gke-oro-cloud-v1-1445426963-ffbcc283-node-bo1l}   Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container c6b7969fd036fd187f8b5b815106887d718780b290b81e6dde12162d15c22728: [8] System error: read parent: connection reset by peer\n"
49s       49s       1         installer-u57ab1f7707b03   Pod                 Warning   FailedSync   {kubelet gke-oro-cloud-v1-1445426963-ffbcc283-node-bo1l}   Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container 5b0d78ee31759a3472f15fe375ef4f2542dcc65518023a1bd06593fe7d28a448: [8] System error: read parent: connection reset by peer\n"
32s       32s       1         installer-u57ab1f7707b03   Pod                 Warning   FailedSync   {kubelet gke-oro-cloud-v1-1445426963-ffbcc283-node-bo1l}   Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container 7ff5941a30ce432aa1b1382e4b20d272a08a7113f79f7f1ff2f8898a00ca8f06: [8] System error: read parent: connection reset by peer\n"
18s       18s       1         installer-u57ab1f7707b03   Pod                 Warning   FailedSync   {kubelet gke-oro-cloud-v1-1445426963-ffbcc283-node-bo1l}   Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container a91ae7d6dc9dee5196e73457d817bc46f8009c26147cc81727920aebfa52cc38: [8] System error: read parent: connection reset by peer\n"
2s        2s        1         installer-u57ab1f7707b03   Pod                 Warning   FailedSync   {kubelet gke-oro-cloud-v1-1445426963-ffbcc283-node-bo1l}   Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container ad8b7bbe72410232d7fe6197e057d15e9003e24f6d8aad15bc7068430cfea508: [8] System error: read parent: connection reset by peer\n"

在docker.log中我发现:

time="2016-08-10T12:37:24.458097892Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/ad8b7bbe72410232d7fe6197e057d15e9003e24f6d8aad15bc7068430cfea508/shm: invalid argument\nfailed to umount /var/lib/docker/containers/ad8b7bbe72410232d7fe6197e057d15e9003e24f6d8aad15bc7068430cfea508/mqueue: invalid argument"
time="2016-08-10T12:37:24.458280187Z" level=error msg="Handler for POST /containers/ad8b7bbe72410232d7fe6197e057d15e9003e24f6d8aad15bc7068430cfea508/start returned error: Cannot start container ad8b7bbe72410232d7fe6197e057d15e9003e24f6d8aad15bc7068430cfea508: [8] System error: read parent: connection reset by peer"
time="2016-08-10T12:37:24.458315257Z" level=error msg="HTTP Error" err="Cannot start container ad8b7bbe72410232d7fe6197e057d15e9003e24f6d8aad15bc7068430cfea508: [8] System error: read parent: connection reset by peer" statusCode=500
time="2016-08-10T12:37:40.151776337Z" level=warning msg="signal: killed" 

Kubernetes 版本 v1.2.5
Docker 版本 1.9.1

有什么解决办法吗?

这可能是由于 runc bug 在 Docker 1.9 中容器读取其配置,但在父级完成写入之前关闭读取管道。

Docker1.10 中包含一个固定的 runc。 Kubernetes 1.3 使用 Docker 1.11.2,但在升级之前,您可以通过 adding extra characters 到容器的命令行来解决此问题。