k8s pods 抢占后卡在 failed/shutdown 状态 (gke v1.20)

Question

TL;DR - gke 1.20 可抢占节点导致 pods 僵尸进入 Failed/Shutdown

几年来，我们一直在使用 GKE，其集群包含稳定节点池和可抢占节点池的混合体。最近，从 gke v1.20 开始，我们开始看到被抢占的 pods 进入了一种奇怪的僵尸状态，它们被描述为：

状态：失败

原因：关机

消息：节点正在关闭，正在逐出 pods

当这种情况开始发生时，我们确信这与我们的 pods 在抢占时未能正确处理 SIGTERM 有关。我们决定将我们的服务软件归结为一个主要处于休眠状态的简单服务，从而消除作为问题根源的服务软件：

/* eslint-disable no-console */
let exitNow = false

process.on( 'SIGINT', () => {
  console.log( 'INT shutting down gracefully' )
  exitNow = true
} )

process.on( 'SIGTERM', () => {
  console.log( 'TERM shutting down gracefully' )
  exitNow = true
} )

const sleep = ( seconds ) => {
  return new Promise( ( resolve ) => {
    setTimeout( resolve, seconds * 1000 )
  } )
}

const Main = async ( cycles = 120, delaySec = 5 ) => {
  console.log( `Starting ${cycles}, ${delaySec} second cycles` )

  for ( let i = 1; i <= cycles && !exitNow; i++ ) {
    console.log( `---> ${i} of ${cycles}` )
    await sleep( delaySec ) // eslint-disable-line
  }

  console.log( '*** Cycle Complete - exiting' )
  process.exit( 0 )
}

Main()

此代码内置于 docker 映像中，使用 tini init 在 nodejs（fermium-alpine 映像）下生成 pod 进程运行。无论我们如何随机处理信号处理，似乎 pods 从未真正干净地关闭，即使日志表明它们是。

另一个奇怪的是，根据 Kubernetes Pod 日志，我们看到 pod 终止开始然后被取消：

2021-08-06 17:00:08.000 EDT 停止容器抢占 pod

2021-08-06 17:02:41.000 EDT 取消删除 Pod preempt-pod

我们还尝试添加 15 秒的 preStop 延迟以查看是否有任何效果，但我们所做的一切似乎都不重要 - pods 变成了僵尸。新副本在池中可用的其他节点上启动，因此它始终保持系统上成功运行 pods 的最小数量。

我们还在使用 sim 维护事件测试抢占周期：

gcloud 计算实例模拟维护事件节点 ID

Answer 1

从 GKE 1.20.5 及更高版本开始，kubelet 优雅节点关闭 feature is enabled 个可抢占节点。来自功能页面上的注释：

When pods were evicted during the graceful node shutdown, they are marked as failed. Running kubectl get pods shows the status of the the evicted pods as Shutdown. And kubectl describe pod indicates that the pod was evicted because of node shutdown:

Status: Failed Reason: Shutdown Message: Node is shutting, evicting pods Failed pod objects will be preserved until explicitly deleted or cleaned up by the GC. This is a change of behavior compared to abrupt node termination.

这些 pods 最终应该被垃圾收集，虽然我不确定阈值。

Answer 2

在研究了各种 post 之后，我终于同意运行每 9 分钟执行一次 cronjob，以避免在 pods 停滞 10+ 后发生的 alertManager 触发器分钟。这对我来说仍然感觉像是一个黑客，但它有效，它迫使我深入研究 k8s cronjob 和 RBAC。

这 post 让我走上了这条道路：

以及由此产生的 cronjob 规范：

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-accessor-role
  namespace: default
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["pods"]
  verbs: ["get", "delete", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: pod-access
  namespace: default
subjects:
- kind: ServiceAccount
  name: cronjob-sa
  namespace: default
roleRef:
  kind: Role
  name: pod-accessor-role
  apiGroup: ""
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cronjob-sa
  namespace: default
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: cron-zombie-killer
  namespace: default
spec:
  schedule: "*/9 * * * *"
  successfulJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        metadata:
          name: cron-zombie-killer
          namespace: default
        spec:
          serviceAccountName: cronjob-sa
          restartPolicy: Never
          containers:
          - name: cron-zombie-killer
            imagePullPolicy: IfNotPresent
            image: bitnami/kubectl
            command:
              - "/bin/sh"
            args:
              - "-c"
              - "kubectl get pods -n default --field-selector='status.phase==Failed' -o name | xargs kubectl delete -n default 2> /dev/null"
status: {}

注意stderr重定向到/dev/null只是为了避免kubectl get在失败状态下找不到任何pods时kubectl delete的错误输出。

Update 添加了角色中缺少的“删除”动词，并添加了缺少的 RoleBinding

更新添加了 imagePullPolicy

k8s pods 抢占后卡在 failed/shutdown 状态 (gke v1.20)

k8s pods stuck in failed/shutdown state after preemption (gke v1.20)

node.js

kubernetes

google-kubernetes-engine