Spark mapPartitionsWithIndex ：标识一个分区

Question

识别一个分区：

mapPartitionsWithIndex(index, iter)

该方法导致将函数驱动到每个分区。我知道我们可以使用 "index" 参数跟踪分区。

许多示例都使用此方法在使用 "index = 0" 条件的数据集中删除 header。但是我们如何确保读取的第一个分区（翻译，"index"参数等于0）确实是header。它是随机的还是基于分区程序（如果使用）。

Answer 1

Isn't it random or based upon the partitioner, if used?

不是随机的，而是partitioner的编号。您可以通过下面提到的简单示例来理解它

val base = sc.parallelize(1 to 100, 4)    
base.mapPartitionsWithIndex((index, iterator) => {

  iterator.map { x => (index, x) }

}).foreach { x => println(x) }

Result : (0,1) (1,26) (2,51) (1,27) (0,2) (0,3) (0,4) (1,28) (2,52) (1,29) (0,5) (1,30) (1,31) (2,53) (1,32) (0,6) ... ...

Spark mapPartitionsWithIndex ：标识一个分区

Spark mapPartitionsWithIndex : Identify a partition

scala

hadoop-partitioning

apache-spark

rdd