Kubernetes 上的 Spark 作业 - 执行者被终止
Spark job on Kubernetes - executors get terminated
我们在 Kubernetes(EKS 非 EMR)上使用 Spark 运算符 运行 spark 作业。
一段时间后,一些执行者获得 SIGNAL TERM,来自执行者的示例日志:
Feb 27 19:44:10.447 s3a-file-system metrics system stopped.
Feb 27 19:44:10.446 Stopping s3a-file-system metrics system...
Feb 27 19:44:10.329 Deleting directory /var/data/spark-05983610-6e9c-4159-a224-0d75fef2dafc/spark-8a21ea7e-bdca-4ade-9fb6-d4fe7ef5530f
Feb 27 19:44:10.328 Shutdown hook called
Feb 27 19:44:10.321 BlockManager stopped
Feb 27 19:44:10.319 MemoryStore cleared
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Feb 27 19:44:10.169 block read in memory in 306 ms. row count = 113970
Feb 27 19:44:09.863 at row 0. reading next block
Feb 27 19:44:09.860 RecordReader initialized will read a total of 113970 records.
driver端,2分钟后driver停止接收心跳,然后决定kill executor
Feb 27 19:46:12.155 Asked to remove non-existent executor 37
Feb 27 19:46:12.155 Removal of executor 37 requested
Feb 27 19:46:12.155 Trying to remove executor 37 from BlockManagerMaster.
Feb 27 19:46:12.154 task 2463.0 in stage 0.0 (TID 2463) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.154 Executor 37 on 172.16.52.23 killed by driver.
Feb 27 19:46:12.153 Trying to remove executor 44 from BlockManagerMaster.
Feb 27 19:46:12.153 Asked to remove non-existent executor 44
Feb 27 19:46:12.153 Removal of executor 44 requested
Feb 27 19:46:12.153 Actual list of executor(s) to be killed is 37
Feb 27 19:46:12.152 task 2595.0 in stage 0.0 (TID 2595) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.152 Executor 44 on 172.16.55.46 killed by driver.
Feb 27 19:46:12.152 Requesting to kill executor(s) 37
Feb 27 19:46:12.151 Actual list of executor(s) to be killed is 44
Feb 27 19:46:12.151 Requesting to kill executor(s) 44
Feb 27 19:46:12.151 Removing executor 37 with no recent heartbeats: 160277 ms exceeds timeout 120000 ms
Feb 27 19:46:12.151 Removing executor 44 with no recent heartbeats: 122513 ms exceeds timeout 120000 ms
我试图了解我们是否正在跨越 Kubernetes 级别的某些资源限制,但找不到类似的东西。
我可以寻找什么来理解 Kubernetes 杀死执行者的原因?
跟进:
我在驱动程序端错过了一条日志消息:
Mar 01 21:04:23.471 Disabling executor 50.
然后在执行者端:
Mar 01 21:04:23.348 RECEIVED SIGNAL TERM
我查看了 class 正在写入 Disabling executor 日志消息,发现了这个 class KubernetesDriverEndpoint
,似乎所有这些都调用了 onDisconnected
方法执行者和此方法在 DriverEndpoint
中调用 disableExecutor
所以现在的问题是为什么这些执行者被认为是断开连接的。
看看这个网站的解释
https://books.japila.pl/apache-spark-internals/scheduler/DriverEndpoint/#ondisconnected-callback
据说那里
Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
但是我在驱动程序端找不到任何 WARN 日志,有什么建议吗?
执行者被杀死的原因是我们 运行 他们在 AWS 的 spot 实例上,这就是它的样子,我们看到执行者被杀死的第一个迹象是这一行这是日志
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
一旦我们为执行者转移到按需实例,在 20 小时的工作中也没有一个执行者被终止
我们在 Kubernetes(EKS 非 EMR)上使用 Spark 运算符 运行 spark 作业。 一段时间后,一些执行者获得 SIGNAL TERM,来自执行者的示例日志:
Feb 27 19:44:10.447 s3a-file-system metrics system stopped.
Feb 27 19:44:10.446 Stopping s3a-file-system metrics system...
Feb 27 19:44:10.329 Deleting directory /var/data/spark-05983610-6e9c-4159-a224-0d75fef2dafc/spark-8a21ea7e-bdca-4ade-9fb6-d4fe7ef5530f
Feb 27 19:44:10.328 Shutdown hook called
Feb 27 19:44:10.321 BlockManager stopped
Feb 27 19:44:10.319 MemoryStore cleared
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Feb 27 19:44:10.169 block read in memory in 306 ms. row count = 113970
Feb 27 19:44:09.863 at row 0. reading next block
Feb 27 19:44:09.860 RecordReader initialized will read a total of 113970 records.
driver端,2分钟后driver停止接收心跳,然后决定kill executor
Feb 27 19:46:12.155 Asked to remove non-existent executor 37
Feb 27 19:46:12.155 Removal of executor 37 requested
Feb 27 19:46:12.155 Trying to remove executor 37 from BlockManagerMaster.
Feb 27 19:46:12.154 task 2463.0 in stage 0.0 (TID 2463) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.154 Executor 37 on 172.16.52.23 killed by driver.
Feb 27 19:46:12.153 Trying to remove executor 44 from BlockManagerMaster.
Feb 27 19:46:12.153 Asked to remove non-existent executor 44
Feb 27 19:46:12.153 Removal of executor 44 requested
Feb 27 19:46:12.153 Actual list of executor(s) to be killed is 37
Feb 27 19:46:12.152 task 2595.0 in stage 0.0 (TID 2595) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.152 Executor 44 on 172.16.55.46 killed by driver.
Feb 27 19:46:12.152 Requesting to kill executor(s) 37
Feb 27 19:46:12.151 Actual list of executor(s) to be killed is 44
Feb 27 19:46:12.151 Requesting to kill executor(s) 44
Feb 27 19:46:12.151 Removing executor 37 with no recent heartbeats: 160277 ms exceeds timeout 120000 ms
Feb 27 19:46:12.151 Removing executor 44 with no recent heartbeats: 122513 ms exceeds timeout 120000 ms
我试图了解我们是否正在跨越 Kubernetes 级别的某些资源限制,但找不到类似的东西。 我可以寻找什么来理解 Kubernetes 杀死执行者的原因?
跟进:
我在驱动程序端错过了一条日志消息:
Mar 01 21:04:23.471 Disabling executor 50.
然后在执行者端:
Mar 01 21:04:23.348 RECEIVED SIGNAL TERM
我查看了 class 正在写入 Disabling executor 日志消息,发现了这个 class KubernetesDriverEndpoint
,似乎所有这些都调用了 onDisconnected
方法执行者和此方法在 DriverEndpoint
中调用 disableExecutor
所以现在的问题是为什么这些执行者被认为是断开连接的。
看看这个网站的解释
https://books.japila.pl/apache-spark-internals/scheduler/DriverEndpoint/#ondisconnected-callback
据说那里
Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
但是我在驱动程序端找不到任何 WARN 日志,有什么建议吗?
执行者被杀死的原因是我们 运行 他们在 AWS 的 spot 实例上,这就是它的样子,我们看到执行者被杀死的第一个迹象是这一行这是日志
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
一旦我们为执行者转移到按需实例,在 20 小时的工作中也没有一个执行者被终止