k8s pods 抢占后卡在 failed/shutdown 状态 (gke v1.20)
k8s pods stuck in failed/shutdown state after preemption (gke v1.20)
TL;DR - gke 1.20 可抢占节点导致 pods 僵尸进入 Failed/Shutdown
几年来,我们一直在使用 GKE,其集群包含稳定节点池和可抢占节点池的混合体。最近,从 gke v1.20 开始,我们开始看到被抢占的 pods 进入了一种奇怪的僵尸状态,它们被描述为:
状态:失败
原因:关机
消息:节点正在关闭,正在逐出 pods
当这种情况开始发生时,我们确信这与我们的 pods 在抢占时未能正确处理 SIGTERM 有关。我们决定将我们的服务软件归结为一个主要处于休眠状态的简单服务,从而消除作为问题根源的服务软件:
/* eslint-disable no-console */
let exitNow = false
process.on( 'SIGINT', () => {
console.log( 'INT shutting down gracefully' )
exitNow = true
} )
process.on( 'SIGTERM', () => {
console.log( 'TERM shutting down gracefully' )
exitNow = true
} )
const sleep = ( seconds ) => {
return new Promise( ( resolve ) => {
setTimeout( resolve, seconds * 1000 )
} )
}
const Main = async ( cycles = 120, delaySec = 5 ) => {
console.log( `Starting ${cycles}, ${delaySec} second cycles` )
for ( let i = 1; i <= cycles && !exitNow; i++ ) {
console.log( `---> ${i} of ${cycles}` )
await sleep( delaySec ) // eslint-disable-line
}
console.log( '*** Cycle Complete - exiting' )
process.exit( 0 )
}
Main()
此代码内置于 docker 映像中,使用 tini init 在 nodejs(fermium-alpine 映像)下生成 pod 进程 运行。无论我们如何随机处理信号处理,似乎 pods 从未真正干净地关闭,即使日志表明它们是。
另一个奇怪的是,根据 Kubernetes Pod 日志,我们看到 pod 终止开始然后被取消:
2021-08-06 17:00:08.000 EDT 停止容器抢占 pod
2021-08-06 17:02:41.000 EDT 取消删除 Pod preempt-pod
我们还尝试添加 15 秒的 preStop 延迟以查看是否有任何效果,但我们所做的一切似乎都不重要 - pods 变成了僵尸。新副本在池中可用的其他节点上启动,因此它始终保持系统上成功 运行 pods 的最小数量。
我们还在使用 sim 维护事件测试抢占周期:
gcloud 计算实例模拟维护事件节点 ID
从 GKE 1.20.5 及更高版本开始,kubelet 优雅节点关闭 feature is enabled 个可抢占节点。来自功能页面上的注释:
When pods were evicted during the graceful node shutdown, they are
marked as failed. Running kubectl get pods shows the status of the the
evicted pods as Shutdown. And kubectl describe pod indicates that the
pod was evicted because of node shutdown:
Status: Failed Reason: Shutdown Message: Node
is shutting, evicting pods Failed pod objects will be preserved until
explicitly deleted or cleaned up by the GC. This is a change of
behavior compared to abrupt node termination.
这些 pods 最终应该被垃圾收集,虽然我不确定阈值。
在研究了各种 post 之后,我终于同意 运行 每 9 分钟执行一次 cronjob,以避免在 pods 停滞 10+ 后发生的 alertManager 触发器分钟。这对我来说仍然感觉像是一个黑客,但它有效,它迫使我深入研究 k8s cronjob 和 RBAC。
这 post 让我走上了这条道路:
以及由此产生的 cronjob 规范:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-accessor-role
namespace: default
rules:
- apiGroups: [""] # "" indicates the core API group
resources: ["pods"]
verbs: ["get", "delete", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-access
namespace: default
subjects:
- kind: ServiceAccount
name: cronjob-sa
namespace: default
roleRef:
kind: Role
name: pod-accessor-role
apiGroup: ""
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cronjob-sa
namespace: default
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: cron-zombie-killer
namespace: default
spec:
schedule: "*/9 * * * *"
successfulJobsHistoryLimit: 1
jobTemplate:
spec:
template:
metadata:
name: cron-zombie-killer
namespace: default
spec:
serviceAccountName: cronjob-sa
restartPolicy: Never
containers:
- name: cron-zombie-killer
imagePullPolicy: IfNotPresent
image: bitnami/kubectl
command:
- "/bin/sh"
args:
- "-c"
- "kubectl get pods -n default --field-selector='status.phase==Failed' -o name | xargs kubectl delete -n default 2> /dev/null"
status: {}
注意stderr重定向到/dev/null只是为了避免kubectl get在失败状态下找不到任何pods时kubectl delete的错误输出。
Update 添加了角色中缺少的“删除”动词,并添加了缺少的 RoleBinding
更新 添加了 imagePullPolicy
TL;DR - gke 1.20 可抢占节点导致 pods 僵尸进入 Failed/Shutdown
几年来,我们一直在使用 GKE,其集群包含稳定节点池和可抢占节点池的混合体。最近,从 gke v1.20 开始,我们开始看到被抢占的 pods 进入了一种奇怪的僵尸状态,它们被描述为:
状态:失败
原因:关机
消息:节点正在关闭,正在逐出 pods
当这种情况开始发生时,我们确信这与我们的 pods 在抢占时未能正确处理 SIGTERM 有关。我们决定将我们的服务软件归结为一个主要处于休眠状态的简单服务,从而消除作为问题根源的服务软件:
/* eslint-disable no-console */
let exitNow = false
process.on( 'SIGINT', () => {
console.log( 'INT shutting down gracefully' )
exitNow = true
} )
process.on( 'SIGTERM', () => {
console.log( 'TERM shutting down gracefully' )
exitNow = true
} )
const sleep = ( seconds ) => {
return new Promise( ( resolve ) => {
setTimeout( resolve, seconds * 1000 )
} )
}
const Main = async ( cycles = 120, delaySec = 5 ) => {
console.log( `Starting ${cycles}, ${delaySec} second cycles` )
for ( let i = 1; i <= cycles && !exitNow; i++ ) {
console.log( `---> ${i} of ${cycles}` )
await sleep( delaySec ) // eslint-disable-line
}
console.log( '*** Cycle Complete - exiting' )
process.exit( 0 )
}
Main()
此代码内置于 docker 映像中,使用 tini init 在 nodejs(fermium-alpine 映像)下生成 pod 进程 运行。无论我们如何随机处理信号处理,似乎 pods 从未真正干净地关闭,即使日志表明它们是。
另一个奇怪的是,根据 Kubernetes Pod 日志,我们看到 pod 终止开始然后被取消:
2021-08-06 17:00:08.000 EDT 停止容器抢占 pod
2021-08-06 17:02:41.000 EDT 取消删除 Pod preempt-pod
我们还尝试添加 15 秒的 preStop 延迟以查看是否有任何效果,但我们所做的一切似乎都不重要 - pods 变成了僵尸。新副本在池中可用的其他节点上启动,因此它始终保持系统上成功 运行 pods 的最小数量。
我们还在使用 sim 维护事件测试抢占周期:
gcloud 计算实例模拟维护事件节点 ID
从 GKE 1.20.5 及更高版本开始,kubelet 优雅节点关闭 feature is enabled 个可抢占节点。来自功能页面上的注释:
When pods were evicted during the graceful node shutdown, they are marked as failed. Running kubectl get pods shows the status of the the evicted pods as Shutdown. And kubectl describe pod indicates that the pod was evicted because of node shutdown:
Status: Failed Reason: Shutdown Message: Node is shutting, evicting pods Failed pod objects will be preserved until explicitly deleted or cleaned up by the GC. This is a change of behavior compared to abrupt node termination.
这些 pods 最终应该被垃圾收集,虽然我不确定阈值。
在研究了各种 post 之后,我终于同意 运行 每 9 分钟执行一次 cronjob,以避免在 pods 停滞 10+ 后发生的 alertManager 触发器分钟。这对我来说仍然感觉像是一个黑客,但它有效,它迫使我深入研究 k8s cronjob 和 RBAC。
这 post 让我走上了这条道路:
以及由此产生的 cronjob 规范:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-accessor-role
namespace: default
rules:
- apiGroups: [""] # "" indicates the core API group
resources: ["pods"]
verbs: ["get", "delete", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-access
namespace: default
subjects:
- kind: ServiceAccount
name: cronjob-sa
namespace: default
roleRef:
kind: Role
name: pod-accessor-role
apiGroup: ""
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: cronjob-sa
namespace: default
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: cron-zombie-killer
namespace: default
spec:
schedule: "*/9 * * * *"
successfulJobsHistoryLimit: 1
jobTemplate:
spec:
template:
metadata:
name: cron-zombie-killer
namespace: default
spec:
serviceAccountName: cronjob-sa
restartPolicy: Never
containers:
- name: cron-zombie-killer
imagePullPolicy: IfNotPresent
image: bitnami/kubectl
command:
- "/bin/sh"
args:
- "-c"
- "kubectl get pods -n default --field-selector='status.phase==Failed' -o name | xargs kubectl delete -n default 2> /dev/null"
status: {}
注意stderr重定向到/dev/null只是为了避免kubectl get在失败状态下找不到任何pods时kubectl delete的错误输出。
Update 添加了角色中缺少的“删除”动词,并添加了缺少的 RoleBinding
更新 添加了 imagePullPolicy