Kubernetes 上的 Spark 作业 - 执行者被终止

Question

我们在 Kubernetes（EKS 非 EMR）上使用 Spark 运算符运行 spark 作业。一段时间后，一些执行者获得 SIGNAL TERM，来自执行者的示例日志：

Feb 27 19:44:10.447 s3a-file-system metrics system stopped.
Feb 27 19:44:10.446 Stopping s3a-file-system metrics system...
Feb 27 19:44:10.329 Deleting directory /var/data/spark-05983610-6e9c-4159-a224-0d75fef2dafc/spark-8a21ea7e-bdca-4ade-9fb6-d4fe7ef5530f
Feb 27 19:44:10.328 Shutdown hook called
Feb 27 19:44:10.321 BlockManager stopped
Feb 27 19:44:10.319 MemoryStore cleared
Feb 27 19:44:10.284 RECEIVED SIGNAL TERM
Feb 27 19:44:10.169 block read in memory in 306 ms. row count = 113970
Feb 27 19:44:09.863 at row 0. reading next block
Feb 27 19:44:09.860 RecordReader initialized will read a total of 113970 records.

driver端，2分钟后driver停止接收心跳，然后决定kill executor

Feb 27 19:46:12.155 Asked to remove non-existent executor 37
Feb 27 19:46:12.155 Removal of executor 37 requested
Feb 27 19:46:12.155 Trying to remove executor 37 from BlockManagerMaster.
Feb 27 19:46:12.154 task 2463.0 in stage 0.0 (TID 2463) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.154 Executor 37 on 172.16.52.23 killed by driver.
Feb 27 19:46:12.153 Trying to remove executor 44 from BlockManagerMaster.
Feb 27 19:46:12.153 Asked to remove non-existent executor 44
Feb 27 19:46:12.153 Removal of executor 44 requested
Feb 27 19:46:12.153 Actual list of executor(s) to be killed is 37
Feb 27 19:46:12.152 task 2595.0 in stage 0.0 (TID 2595) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
Feb 27 19:46:12.152 Executor 44 on 172.16.55.46 killed by driver.
Feb 27 19:46:12.152 Requesting to kill executor(s) 37
Feb 27 19:46:12.151 Actual list of executor(s) to be killed is 44
Feb 27 19:46:12.151 Requesting to kill executor(s) 44
Feb 27 19:46:12.151 Removing executor 37 with no recent heartbeats: 160277 ms exceeds timeout 120000 ms
Feb 27 19:46:12.151 Removing executor 44 with no recent heartbeats: 122513 ms exceeds timeout 120000 ms

我试图了解我们是否正在跨越 Kubernetes 级别的某些资源限制，但找不到类似的东西。我可以寻找什么来理解 Kubernetes 杀死执行者的原因？

跟进：

我在驱动程序端错过了一条日志消息：

Mar 01 21:04:23.471 Disabling executor 50.

然后在执行者端：

Mar 01 21:04:23.348 RECEIVED SIGNAL TERM

我查看了 class 正在写入 Disabling executor 日志消息，发现了这个 class KubernetesDriverEndpoint，似乎所有这些都调用了 onDisconnected 方法执行者和此方法在 DriverEndpoint 中调用 disableExecutor 所以现在的问题是为什么这些执行者被认为是断开连接的。看看这个网站的解释 https://books.japila.pl/apache-spark-internals/scheduler/DriverEndpoint/#ondisconnected-callback 据说那里

Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

但是我在驱动程序端找不到任何 WARN 日志，有什么建议吗？

Answer 1

执行者被杀死的原因是我们运行他们在 AWS 的 spot 实例上，这就是它的样子，我们看到执行者被杀死的第一个迹象是这一行这是日志

Feb 27 19:44:10.284 RECEIVED SIGNAL TERM

一旦我们为执行者转移到按需实例，在 20 小时的工作中也没有一个执行者被终止

Kubernetes 上的 Spark 作业 - 执行者被终止

Spark job on Kubernetes - executors get terminated

apache-spark

kubernetes

spark-operator