我们如何针对 Spark 作业的不同阶段优化 CPU/core/executor？

Question

如下图所示：

我的 Spark 作业分为三个阶段：

0. groupBy
1. repartition
2. collect

第 0 阶段和第 1 阶段非常轻量级，但是第 2 阶段CPU 非常密集。

是否可以针对一个 Spark 作业的不同阶段进行不同的配置？

我考虑过将这个 Spark 作业分成两个子作业，但这违背了使用将所有中间结果存储在内存中的 Spark 的目的。这也将显着延长我们的工作时间。

有什么想法吗？

Answer 1

不，无法在运行时更改 spark 配置。请参阅 documentation 以获取 SparkConf：

Note that once a SparkConf object is passed to Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime.

不过，如果中间没有其他操作，我猜你不需要在 collect 之前执行 repartition。 repartition 将在节点上移动数据，如果你想做的是 collect 将它们移动到驱动程序节点上，这是不必要的。

Answer 2

我同意 Shaido 的观点。但是想在这里包括 Spark 2.x 带有称为动态资源分配的东西。

https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

At a high level, Spark should relinquish executors when they are no longer used and acquire executors when they are needed.

这意味着应用程序可以动态更改值而不是使用 spark.executor.instances

spark.executor.instances is incompatible with spark.dynamicAllocation.enabled. If both spark.dynamicAllocation.enabled and spark.executor.instances are specified, dynamic allocation is turned off and the specified number of spark.executor.instances is used.

我们如何针对 Spark 作业的不同阶段优化 CPU/core/executor？

How can we optimize CPU/core/executor for different stages of a Spark job?

performance

distributed-computing

apache-spark