当连接后跟合并时 spark 是如何工作的
How spark works when a join is followed by a coalesce
鉴于我有 2 DataFrame
s df1
和 df2
我执行 join
然后执行 coalesce
df1.join(df2, Seq("id")).coalesce(1)
似乎 Spark 创建了 2 个阶段,而第二个阶段(发生 SortMergeJoin 的地方)仅由一个任务计算。
所以这个独特的任务需要将整个数据帧都存储在内存中(cf : http://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks)。
你能确认一下吗?
(我原以为排序会使用 spark.sql.shuffle.partitions
设置,而第三个附加阶段会执行合并)。
cf DAG
我在书中找到了确认 High Performance Spark。
Since tasks are executed on the child partition, the number of tasks
executed in a stage that includes a coalesce
operation is equivalent
to the number of partitions in the result RDD of the coalesce
transformation.
鉴于我有 2 DataFrame
s df1
和 df2
我执行 join
然后执行 coalesce
df1.join(df2, Seq("id")).coalesce(1)
似乎 Spark 创建了 2 个阶段,而第二个阶段(发生 SortMergeJoin 的地方)仅由一个任务计算。
所以这个独特的任务需要将整个数据帧都存储在内存中(cf : http://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks)。
你能确认一下吗?
(我原以为排序会使用 spark.sql.shuffle.partitions
设置,而第三个附加阶段会执行合并)。
cf DAG
我在书中找到了确认 High Performance Spark。
Since tasks are executed on the child partition, the number of tasks executed in a stage that includes a
coalesce
operation is equivalent to the number of partitions in the result RDD of thecoalesce
transformation.