当连接后跟合并时 spark 是如何工作的

Question

鉴于我有 2 DataFrames df1 和 df2

我执行 join 然后执行 coalesce

df1.join(df2, Seq("id")).coalesce(1)

似乎 Spark 创建了 2 个阶段，而第二个阶段（发生 SortMergeJoin 的地方）仅由一个任务计算。

所以这个独特的任务需要将整个数据帧都存储在内存中（cf : http://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks）。

你能确认一下吗？

（我原以为排序会使用 spark.sql.shuffle.partitions 设置，而第三个附加阶段会执行合并）。

cf DAG

Answer 1

我在书中找到了确认 High Performance Spark。

Since tasks are executed on the child partition, the number of tasks executed in a stage that includes a coalesce operation is equivalent to the number of partitions in the result RDD of the coalesce transformation.

当连接后跟合并时 spark 是如何工作的

How spark works when a join is followed by a coalesce

apache-spark

spark-dataframe