Spark 如何决定如何对 RDD 进行分区？

Question

假设我创建了这样一个 RDD（我正在使用 Pyspark）：

list_rdd = sc.parallelize(xrange(0, 20, 2), 6)

然后我用 glom() 方法打印分区元素并获得

[[0], [2, 4], [6, 8], [10], [12, 14], [16, 18]]

Spark 如何决定如何对我的列表进行分区？元素的特定选择从何而来？它可以以不同的方式耦合它们，留下 0 和 10 之外的一些其他元素，以创建 6 个请求的分区。一秒运行，分区是一样的

使用具有 29 个元素的更大范围，我得到了 2 个元素后跟三个元素的模式的分区：

list_rdd = sc.parallelize(xrange(0, 30, 2), 6)
[[0, 2], [4, 6, 8], [10, 12], [14, 16, 18], [20, 22], [24, 26, 28]]

使用 9 个元素的较小范围我得到

list_rdd = sc.parallelize(xrange(0, 10, 2), 6)
[[], [0], [2], [4], [6], [8]]

所以我推断 Spark 正在通过将列表拆分为一个配置来生成分区，在该配置中，尽可能小的集合后面跟着更大的集合，并重复进行。

问题是这个选择背后是否有原因，这很优雅，但它是否也提供了性能优势？

Answer 1

除非您指定特定的分区程序，否则这就是 "random"，因为它取决于该 RDD 的具体实现。在这种情况下，您可以前往 ParallelCollectionsRDD 进一步深入研究。

getPartitions 定义为：

val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray

其中 slice 被注释为（重新格式化以更好地适应）：

/**
* Slice a collection into numSlices sub-collections. 
* One extra thing we do here is to treat Range collections specially, 
* encoding the slices as other Ranges to minimize memory cost. 
* This makes it efficient to run Spark over RDDs representing large sets of numbers. 
* And if the collection is an inclusive Range, 
* we use inclusive range for the last slice.
*/

请注意，有一些关于内存的注意事项。因此，同样，这将特定于实现。

Spark 如何决定如何对 RDD 进行分区？

How does Spark decide how to partition an RDD?

apache-spark

rdd

pyspark