当连接后跟合并时 spark 是如何工作的

How spark works when a join is followed by a coalesce

鉴于我有 2 DataFrames df1df2

我执行 join 然后执行 coalesce

df1.join(df2, Seq("id")).coalesce(1)

似乎 Spark 创建了 2 个阶段,而第二个阶段(发生 SortMergeJoin 的地方)仅由一个任务计算。

所以这个独特的任务需要将整个数据帧都存储在内存中(cf : http://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks)。

你能确认一下吗?

(我原以为排序会使用 spark.sql.shuffle.partitions 设置,而第三个附加阶段会执行合并)。

cf DAG

我在书中找到了确认 High Performance Spark

Since tasks are executed on the child partition, the number of tasks executed in a stage that includes a coalesce operation is equivalent to the number of partitions in the result RDD of the coalesce transformation.