在 Spark 中，当没有指定分区器时，ReduceByKey 操作是否会在开始聚合之前按哈希重新分区数据？

Question

如果我们没有提及任何用于 reduceByKey 操作的分区器，它是否在 reduce 之前在内部执行 hashPartitioning？例如我的测试代码是这样的：

val rdd = sc.parallelize(Seq((5, 1), (10, 2), (15, 3), (5, 4), (5, 1), (5,3), (5,9), (5,6)), 5)
val newRdd = rdd.reduceByKey((a,b) => (a+b))

这里，reduceByKey操作是否将具有相同键的所有记录带到同一个分区并执行归约（对于上面没有提到分区程序的代码）？由于我的用例有倾斜数据（都具有相同的键），如果将所有记录都放在一个分区中，可能会导致 out of memory 错误。因此，在所有分区上均匀分布记录适合此处的用例。

Answer 1

事实上，使用 reduceByKey 而不是 groupByKey 的最大优势在于 spark 会在随机播放之前（即在重新分区任何内容之前）组合同一分区上的键。因此，由于使用 reduceByKey.

的数据偏斜，出现内存问题的可能性很小

有关更多详细信息，您可能需要从比较 reduceByKey 与 groupByKey 的数据块中阅读 this post。特别是，他们这样说：

While both of these functions will produce the correct answer, the reduceByKey example works much better on a large dataset. That's because Spark knows it can combine output with a common key on each partition before shuffling the data.

在 Spark 中，当没有指定分区器时，ReduceByKey 操作是否会在开始聚合之前按哈希重新分区数据？

In Spark, when no partitioner is specified, does the ReduceByKey operation repartition the data by hash before starting aggregating it?

scala

data-partitioning

apache-spark

rdd