Kubernetes Pods 已终止 - 退出代码 137
Kubernetes Pods Terminated - Exit Code 137
我需要一些关于我在使用 k8s 1.14 和 运行 gitlab 管道时遇到的问题的建议。许多作业都抛出了退出代码 137 错误,我发现这意味着容器正在突然终止。
集群信息:
Kubernetes 版本:1.14
使用的云:AWS EKS
节点:C5.4xLarge
深入挖掘后,我发现了以下日志:
**kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).
**kubelet: E0114 03:37:08.653132** 4721 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes
**kubelet: W0114 03:37:23.240990** 4721 eviction_manager.go:397] eviction manager: timed out waiting for pods runner-u4zrz1by-project-12123209-concurrent-4zz892_gitlab-managed-apps(d9331870-367e-11ea-b638-0673fa95f662) to be cleaned up
**kubelet: W0114 00:15:51.106881** 4781 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage
**kubelet: I0114 00:15:51.106907** 4781 container_gc.go:85] attempting to delete unused containers
**kubelet: I0114 00:15:51.116286** 4781 image_gc_manager.go:317] attempting to delete unused images
**kubelet: I0114 00:15:51.130499** 4781 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage
**kubelet: I0114 00:15:51.130648** 4781 eviction_manager.go:362] eviction manager: pods ranked for eviction:
1. runner-u4zrz1by-project-10310692-concurrent-1mqrmt_gitlab-managed-apps(d16238f0-3661-11ea-b638-0673fa95f662)
2. runner-u4zrz1by-project-10310692-concurrent-0hnnlm_gitlab-managed-apps(d1017c51-3661-11ea-b638-0673fa95f662)
3. runner-u4zrz1by-project-13074486-concurrent-0dlcxb_gitlab-managed-apps(63d78af9-3662-11ea-b638-0673fa95f662)
4. prometheus-deployment-66885d86f-6j9vt_prometheus(da2788bb-3651-11ea-b638-0673fa95f662)
5. nginx-ingress-controller-7dcc95dfbf-ld67q_ingress-nginx(6bf8d8e0-35ca-11ea-b638-0673fa95f662)
然后 pods 终止,导致退出代码 137s。
任何人都可以帮助我了解原因以及克服此问题的可能解决方案吗?
谢谢 :)
能够解决问题。
节点最初有 20G 的 ebs 卷和 c5.4xlarge 实例类型。我将 ebs 增加到 50 和 100G,但这没有帮助,因为我一直看到以下错误:
"Disk usage on image filesystem is at 95% which is over the high
threshold (85%). Trying to free 3022784921 bytes down to the low
threshold (80%). "
然后我将实例类型更改为 c5d.4xlarge,它具有 400GB 的缓存存储并提供 300GB 的 EBS。这解决了错误。
一些 gitlab 作业是针对一些 java 应用程序的,这些应用程序正在消耗大量缓存 space 并写入大量日志。
退出代码 137 并不一定意味着 OOMKilled。它表示失败,因为容器收到 SIGKILL(某些中断或“oom-killer”[内存不足])
如果 pod 被 OOMKilled,您将在描述 pod 时看到下面的行
State: Terminated
Reason: OOMKilled
编辑于 2/2/2022
我看到您从日志中添加了 **kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).
和 must evict pod(s) to reclaim ephemeral-storage
。它通常发生在应用程序 pods 正在向磁盘写入日志文件等内容时。管理员可以配置何时(以什么磁盘使用率 %)进行驱逐。
此错误代码的典型原因可能是系统内存不足,或者运行状况检查失败
137 意味着 k8s 出于某种原因杀死了容器(可能是它没有通过 liveness probe)
Cod 137 是 128 + 9(SIGKILL) 进程被外部信号杀死
检查 Jenkins 的主节点内存和 CPU 配置文件。在我的例子中,它是高内存和 CPU 利用率下的主机,从机以 137.
重新启动
我需要一些关于我在使用 k8s 1.14 和 运行 gitlab 管道时遇到的问题的建议。许多作业都抛出了退出代码 137 错误,我发现这意味着容器正在突然终止。
集群信息:
Kubernetes 版本:1.14 使用的云:AWS EKS 节点:C5.4xLarge
深入挖掘后,我发现了以下日志:
**kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).
**kubelet: E0114 03:37:08.653132** 4721 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes
**kubelet: W0114 03:37:23.240990** 4721 eviction_manager.go:397] eviction manager: timed out waiting for pods runner-u4zrz1by-project-12123209-concurrent-4zz892_gitlab-managed-apps(d9331870-367e-11ea-b638-0673fa95f662) to be cleaned up
**kubelet: W0114 00:15:51.106881** 4781 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage
**kubelet: I0114 00:15:51.106907** 4781 container_gc.go:85] attempting to delete unused containers
**kubelet: I0114 00:15:51.116286** 4781 image_gc_manager.go:317] attempting to delete unused images
**kubelet: I0114 00:15:51.130499** 4781 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage
**kubelet: I0114 00:15:51.130648** 4781 eviction_manager.go:362] eviction manager: pods ranked for eviction:
1. runner-u4zrz1by-project-10310692-concurrent-1mqrmt_gitlab-managed-apps(d16238f0-3661-11ea-b638-0673fa95f662)
2. runner-u4zrz1by-project-10310692-concurrent-0hnnlm_gitlab-managed-apps(d1017c51-3661-11ea-b638-0673fa95f662)
3. runner-u4zrz1by-project-13074486-concurrent-0dlcxb_gitlab-managed-apps(63d78af9-3662-11ea-b638-0673fa95f662)
4. prometheus-deployment-66885d86f-6j9vt_prometheus(da2788bb-3651-11ea-b638-0673fa95f662)
5. nginx-ingress-controller-7dcc95dfbf-ld67q_ingress-nginx(6bf8d8e0-35ca-11ea-b638-0673fa95f662)
然后 pods 终止,导致退出代码 137s。
任何人都可以帮助我了解原因以及克服此问题的可能解决方案吗?
谢谢 :)
能够解决问题。
节点最初有 20G 的 ebs 卷和 c5.4xlarge 实例类型。我将 ebs 增加到 50 和 100G,但这没有帮助,因为我一直看到以下错误:
"Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%). "
然后我将实例类型更改为 c5d.4xlarge,它具有 400GB 的缓存存储并提供 300GB 的 EBS。这解决了错误。
一些 gitlab 作业是针对一些 java 应用程序的,这些应用程序正在消耗大量缓存 space 并写入大量日志。
退出代码 137 并不一定意味着 OOMKilled。它表示失败,因为容器收到 SIGKILL(某些中断或“oom-killer”[内存不足])
如果 pod 被 OOMKilled,您将在描述 pod 时看到下面的行
State: Terminated
Reason: OOMKilled
编辑于 2/2/2022
我看到您从日志中添加了 **kubelet: I0114 03:37:08.639450** 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).
和 must evict pod(s) to reclaim ephemeral-storage
。它通常发生在应用程序 pods 正在向磁盘写入日志文件等内容时。管理员可以配置何时(以什么磁盘使用率 %)进行驱逐。
此错误代码的典型原因可能是系统内存不足,或者运行状况检查失败
137 意味着 k8s 出于某种原因杀死了容器(可能是它没有通过 liveness probe)
Cod 137 是 128 + 9(SIGKILL) 进程被外部信号杀死
检查 Jenkins 的主节点内存和 CPU 配置文件。在我的例子中,它是高内存和 CPU 利用率下的主机,从机以 137.
重新启动