是否可以合并 Spark 分区 "evenly"？

Is it possible to coalesce Spark partitions "evenly"?

假设我们有一个 PySpark 数据帧，数据均匀分布在 2048 个分区中，我们想要合并到 32 个分区以将数据写回 HDFS。使用 coalesce 对此很好，因为它不需要昂贵的洗牌。

但是 coalesce 的缺点之一是它通常会导致新分区中的数据分布不均匀。我假设这是因为原始分区 ID 被散列到新分区 ID space，并且冲突次数是随机的。

但是，原则上应该可以均匀合并，以便将原始数据帧的前 64 个分区发送到新数据帧的第一个分区，接下来的 64 个分区发送到第二个分区，依此类推结束，导致分区均匀分布。生成的数据框通常更适合进一步计算。

这是否可能，同时防止随机播放？

我可以使用中的技巧在初始分区和最终分区之间强制建立我想要的关系，但 Spark 不知道每个原始分区中的所有内容都将转到特定的新分区。因此它无法优化随机播放，而且它的运行速度比 coalesce.

慢得多

在您的情况下，您可以安全地将 2048 个分区合并为 32 个，并假设 Spark 会将上游分区平均分配给合并的分区（在您的情况下每个分区 64 个）。

这里是an extract from the Scaladoc of RDD#coalesce:

This results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.

请考虑您的分区在集群中的物理分布方式也会影响合并发生的方式。以下摘自CoalescedRDD's ScalaDoc:

If there is no locality information (no preferredLocations) in the parent, then the coalescing is very simple: chunk parents that are close in the Array in chunks. If there is locality information, it proceeds to pack them with the following four goals:

(1) Balance the groups so they roughly have the same number of parent partitions

(2) Achieve locality per partition, i.e. find one machine which most parent partitions prefer

(3) Be efficient, i.e. O(n) algorithm for n parent partitions (problem is likely NP-hard)

(4) Balance preferred machines, i.e. avoid as much as possible picking the same preferred machine

是否可以合并 Spark 分区 "evenly"？

Is it possible to coalesce Spark partitions "evenly"?

partitioning

apache-spark

pyspark