Kubernetes 上的 Spark 作业 - 执行者被终止

Spark job on Kubernetes - executors get terminated

我们在 Kubernetes(EKS 非 EMR)上使用 Spark 运算符 运行 spark 作业。 一段时间后,一些执行者获得 SIGNAL TERM,来自执行者的示例日志:

Feb 27 19:44:10.447 s3a-file-system metrics system stopped.
Feb 27 19:44:10.446 Stopping s3a-file-system metrics system...
Feb 27 19:44:10.329 Deleting directory /var/data/spark-05983610-6e9c-4159-a224-0d75fef2dafc/spark-8a21ea7e-bdca-4ade-9fb6-d4fe7ef5530f
Feb 27 19:44:10.328 Shutdown hook called
Feb 27 19:44:10.321 BlockManager stopped
Feb 27 19:44:10.319 MemoryStore cleared
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Feb 27 19:44:10.169 block read in memory in 306 ms. row count = 113970
Feb 27 19:44:09.863 at row 0. reading next block
Feb 27 19:44:09.860 RecordReader initialized will read a total of 113970 records.

driver端,2分钟后driver停止接收心跳,然后决定kill executor

Feb 27 19:46:12.155 Asked to remove non-existent executor 37
Feb 27 19:46:12.155 Removal of executor 37 requested
Feb 27 19:46:12.155 Trying to remove executor 37 from BlockManagerMaster.
Feb 27 19:46:12.154 task 2463.0 in stage 0.0 (TID 2463) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.154 Executor 37 on 172.16.52.23 killed by driver.
Feb 27 19:46:12.153 Trying to remove executor 44 from BlockManagerMaster.
Feb 27 19:46:12.153 Asked to remove non-existent executor 44
Feb 27 19:46:12.153 Removal of executor 44 requested
Feb 27 19:46:12.153 Actual list of executor(s) to be killed is 37
Feb 27 19:46:12.152 task 2595.0 in stage 0.0 (TID 2595) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.152 Executor 44 on 172.16.55.46 killed by driver.
Feb 27 19:46:12.152 Requesting to kill executor(s) 37
Feb 27 19:46:12.151 Actual list of executor(s) to be killed is 44
Feb 27 19:46:12.151 Requesting to kill executor(s) 44
Feb 27 19:46:12.151 Removing executor 37 with no recent heartbeats: 160277 ms exceeds timeout 120000 ms
Feb 27 19:46:12.151 Removing executor 44 with no recent heartbeats: 122513 ms exceeds timeout 120000 ms

我试图了解我们是否正在跨越 Kubernetes 级别的某些资源限制,但找不到类似的东西。 我可以寻找什么来理解 Kubernetes 杀死执行者的原因?

跟进:

我在驱动程序端错过了一条日志消息:

Mar 01 21:04:23.471 Disabling executor 50.

然后在执行者端:

Mar 01 21:04:23.348 RECEIVED SIGNAL TERM

我查看了 class 正在写入 Disabling executor 日志消息,发现了这个 class KubernetesDriverEndpoint,似乎所有这些都调用了 onDisconnected 方法执行者和此方法在 DriverEndpoint 中调用 disableExecutor 所以现在的问题是为什么这些执行者被认为是断开连接的。 看看这个网站的解释 https://books.japila.pl/apache-spark-internals/scheduler/DriverEndpoint/#ondisconnected-callback 据说那里

Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

但是我在驱动程序端找不到任何 WARN 日志,有什么建议吗?

执行者被杀死的原因是我们 运行 他们在 AWS 的 spot 实例上,这就是它的样子,我们看到执行者被杀死的第一个迹象是这一行这是日志

Feb 27 19:44:10.284 RECEIVED SIGNAL TERM

一旦我们为执行者转移到按需实例,在 20 小时的工作中也没有一个执行者被终止