kubernetes部署突然pod重启,原因?
Sudden pod restart of kubernetes deployment, reason?
我已经使用 Helm v3 在 GKE 上部署了微服务;所有 apps/helms 几个月都很好,但昨天由于某种原因 pods 被重新创建
kubectl get pods -l app=myapp
NAME READY STATUS RESTARTS AGE
myapp-75cb966746-grjkj 1/1 Running 1 14h
myapp-75cb966746-gz7g7 1/1 Running 0 14h
myapp-75cb966746-nmzzx 1/1 Running 1 14h
helm3 history myapp
显示它是在 2 天前(40 小时以上)更新的,而不是昨天(所以我排除了某人只是 运行 helm3 upgrade ..
的可能性;(似乎有人 运行 一个命令 kubectl rollout restart deployment/myapp
),有什么想法我如何检查 pods 为何重新启动?我不确定如何验证它;PS:来自 [=18= 的日志] 只回到3小时前
仅供参考,我不是要这个命令 kubectl logs -p myapp-75cb966746-grjkj
,这样就没问题了,我想知道 14 小时前的 3 pods 发生了什么,并且只是 deleted/replaced - 以及如何检查。
集群上也没有事件
MacBook-Pro% kubectl get events
No resources found in myns namespace.
至于描述部署,首先部署是几个月前创建的
CreationTimestamp: Thu, 22 Oct 2020 09:19:39 +0200
最后一次更新是在 >40 小时前
lastUpdate: 2021-04-07 07:10:09.715630534 +0200 CEST m=+1.867748121
这里是完整描述如果有人想要
MacBook-Pro% kubectl describe deployment myapp
Name: myapp
Namespace: myns
CreationTimestamp: Thu, 22 Oct 2020 09:19:39 +0200
Labels: app=myapp
Annotations: deployment.kubernetes.io/revision: 42
lastUpdate: 2021-04-07 07:10:09.715630534 +0200 CEST m=+1.867748121
meta.helm.sh/release-name: myapp
meta.helm.sh/release-namespace: myns
Selector: app=myapp,env=myns
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 5
RollingUpdateStrategy: 25% max unavailable, 1 max surge
Pod Template:
Labels: app=myapp
Annotations: kubectl.kubernetes.io/restartedAt: 2020-10-23T11:21:11+02:00
Containers:
myapp:
Image: xxx
Port: 8080/TCP
Host Port: 0/TCP
Limits:
cpu: 1
memory: 1G
Requests:
cpu: 1
memory: 1G
Liveness: http-get http://:myappport/status delay=45s timeout=5s period=10s #success=1 #failure=3
Readiness: http-get http://:myappport/status delay=45s timeout=5s period=10s #success=1 #failure=3
Environment Variables from:
myapp-myns Secret Optional: false
Environment:
myenv: myval
Mounts:
/some/path from myvol (ro)
Volumes:
myvol:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: myvol
Optional: false
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: myapp-75cb966746 (3/3 replicas created)
Events: <none>
你可以使用
kubectl describe pod your_pod_name
在 Containers.your_container_name.lastState 中,您会得到最后一个 pod 终止的时间和原因(例如,由于错误或由于 OOMKilled)
文档参考:
kubectl explain pod.status.containerStatuses.lastState
KIND: Pod
VERSION: v1
RESOURCE: lastState <Object>
DESCRIPTION:
Details about the container's last termination condition.
ContainerState holds a possible state of container. Only one of its members
may be specified. If none of them is specified, the default one is
ContainerStateWaiting.
FIELDS:
running <Object>
Details about a running container
terminated <Object>
Details about a terminated container
waiting <Object>
Details about a waiting container
我的一个容器的示例,由于应用程序错误而终止:
Containers:
my_container:
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Tue, 06 Apr 2021 16:28:57 +0300
Finished: Tue, 06 Apr 2021 16:32:07 +0300
要获取容器(重新启动的容器)之前的日志,您可以在 pod 上使用 --previous
键,如下所示:
kubectl logs your_pod_name --previous
我建议你 运行 kubectl describe deployment <deployment-name>
和 kubectl describe pod <pod-name>
.
此外,kubectl get events
将显示集群级别的事件,可能有助于您了解发生了什么。
首先,我会检查 Pods 是 运行 的节点。
- 如果 Pod 重新启动(这意味着 RESTART COUNT 递增),通常意味着 Pod 发生错误并且该错误导致 Pod 崩溃。
- 在你的情况下,Pod 是完全重新创建的,这意味着(就像你说的)有人可以使用 rollout 重启,或者部署先缩小然后再放大(都是手动操作)。
自动创建 Pods 的最常见情况是 Pods 正在执行的节点有问题。如果一个节点变为 NotReady,即使是很短的时间,Kubernetes Scheduler 也会尝试在其他节点上调度新的 Pods 以匹配所需的状态(副本数等等)
Old Pods 在 NotReady 节点上将进入 Terminating 状态,一旦NotReady 节点将再次变为 Ready(如果它们仍在运行并且 运行)
这在文档中有详细描述 (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-lifetime)
If a Node dies, the Pods scheduled to that node are scheduled for deletion after a timeout period. Pods do not, by themselves, self-heal. If a Pod is scheduled to a node that then fails, the Pod is deleted; likewise, a Pod won't survive an eviction due to a lack of resources or Node maintenance. Kubernetes uses a higher-level abstraction, called a controller, that handles the work of managing the relatively disposable Pod instances.
我已经使用 Helm v3 在 GKE 上部署了微服务;所有 apps/helms 几个月都很好,但昨天由于某种原因 pods 被重新创建
kubectl get pods -l app=myapp
NAME READY STATUS RESTARTS AGE
myapp-75cb966746-grjkj 1/1 Running 1 14h
myapp-75cb966746-gz7g7 1/1 Running 0 14h
myapp-75cb966746-nmzzx 1/1 Running 1 14h
helm3 history myapp
显示它是在 2 天前(40 小时以上)更新的,而不是昨天(所以我排除了某人只是 运行 helm3 upgrade ..
的可能性;(似乎有人 运行 一个命令 kubectl rollout restart deployment/myapp
),有什么想法我如何检查 pods 为何重新启动?我不确定如何验证它;PS:来自 [=18= 的日志] 只回到3小时前
仅供参考,我不是要这个命令 kubectl logs -p myapp-75cb966746-grjkj
,这样就没问题了,我想知道 14 小时前的 3 pods 发生了什么,并且只是 deleted/replaced - 以及如何检查。
集群上也没有事件
MacBook-Pro% kubectl get events
No resources found in myns namespace.
至于描述部署,首先部署是几个月前创建的
CreationTimestamp: Thu, 22 Oct 2020 09:19:39 +0200
最后一次更新是在 >40 小时前
lastUpdate: 2021-04-07 07:10:09.715630534 +0200 CEST m=+1.867748121
这里是完整描述如果有人想要
MacBook-Pro% kubectl describe deployment myapp
Name: myapp
Namespace: myns
CreationTimestamp: Thu, 22 Oct 2020 09:19:39 +0200
Labels: app=myapp
Annotations: deployment.kubernetes.io/revision: 42
lastUpdate: 2021-04-07 07:10:09.715630534 +0200 CEST m=+1.867748121
meta.helm.sh/release-name: myapp
meta.helm.sh/release-namespace: myns
Selector: app=myapp,env=myns
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 5
RollingUpdateStrategy: 25% max unavailable, 1 max surge
Pod Template:
Labels: app=myapp
Annotations: kubectl.kubernetes.io/restartedAt: 2020-10-23T11:21:11+02:00
Containers:
myapp:
Image: xxx
Port: 8080/TCP
Host Port: 0/TCP
Limits:
cpu: 1
memory: 1G
Requests:
cpu: 1
memory: 1G
Liveness: http-get http://:myappport/status delay=45s timeout=5s period=10s #success=1 #failure=3
Readiness: http-get http://:myappport/status delay=45s timeout=5s period=10s #success=1 #failure=3
Environment Variables from:
myapp-myns Secret Optional: false
Environment:
myenv: myval
Mounts:
/some/path from myvol (ro)
Volumes:
myvol:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: myvol
Optional: false
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available True MinimumReplicasAvailable
OldReplicaSets: <none>
NewReplicaSet: myapp-75cb966746 (3/3 replicas created)
Events: <none>
你可以使用
kubectl describe pod your_pod_name
在 Containers.your_container_name.lastState 中,您会得到最后一个 pod 终止的时间和原因(例如,由于错误或由于 OOMKilled)
文档参考:
kubectl explain pod.status.containerStatuses.lastState
KIND: Pod
VERSION: v1
RESOURCE: lastState <Object>
DESCRIPTION:
Details about the container's last termination condition.
ContainerState holds a possible state of container. Only one of its members
may be specified. If none of them is specified, the default one is
ContainerStateWaiting.
FIELDS:
running <Object>
Details about a running container
terminated <Object>
Details about a terminated container
waiting <Object>
Details about a waiting container
我的一个容器的示例,由于应用程序错误而终止:
Containers:
my_container:
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Tue, 06 Apr 2021 16:28:57 +0300
Finished: Tue, 06 Apr 2021 16:32:07 +0300
要获取容器(重新启动的容器)之前的日志,您可以在 pod 上使用 --previous
键,如下所示:
kubectl logs your_pod_name --previous
我建议你 运行 kubectl describe deployment <deployment-name>
和 kubectl describe pod <pod-name>
.
此外,kubectl get events
将显示集群级别的事件,可能有助于您了解发生了什么。
首先,我会检查 Pods 是 运行 的节点。
- 如果 Pod 重新启动(这意味着 RESTART COUNT 递增),通常意味着 Pod 发生错误并且该错误导致 Pod 崩溃。
- 在你的情况下,Pod 是完全重新创建的,这意味着(就像你说的)有人可以使用 rollout 重启,或者部署先缩小然后再放大(都是手动操作)。
自动创建 Pods 的最常见情况是 Pods 正在执行的节点有问题。如果一个节点变为 NotReady,即使是很短的时间,Kubernetes Scheduler 也会尝试在其他节点上调度新的 Pods 以匹配所需的状态(副本数等等)
Old Pods 在 NotReady 节点上将进入 Terminating 状态,一旦NotReady 节点将再次变为 Ready(如果它们仍在运行并且 运行)
这在文档中有详细描述 (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-lifetime)
If a Node dies, the Pods scheduled to that node are scheduled for deletion after a timeout period. Pods do not, by themselves, self-heal. If a Pod is scheduled to a node that then fails, the Pod is deleted; likewise, a Pod won't survive an eviction due to a lack of resources or Node maintenance. Kubernetes uses a higher-level abstraction, called a controller, that handles the work of managing the relatively disposable Pod instances.