buckets和partitions是什么关系？

Question

RDD的Partitions和RDD的内容在shuffle操作之前映射到的Buckets之间有关系吗？

其次，是否所有具有相同键的键值对都被洗牌到同一个桶中，还是键值对随机分配到桶中？指定分区程序 (hash/range) 是否对此分布有任何影响？

Answer 1

Is there a relationship between the Partitions of an RDD and the Buckets which the contents of the RDD get mapped to before a shuffle operation ?

如果你问 bucketed tables（在 bucketBy 和 spark.table("bucketed_table") 之后）我认为答案是肯定的。

让我告诉你我回答“是”的意思。

val large = spark.range(1000000)
scala> println(large.queryExecution.toRdd.getNumPartitions)
8

scala> large.write.bucketBy(4, "id").saveAsTable("bucketed_4_id")
18/04/18 22:00:58 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`bucketed_4_id` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

scala> println(spark.table("bucketed_4_id").queryExecution.toRdd.getNumPartitions)
4

换句话说，分区的数量（在加载一个分桶的 table 之后）正好是桶的数量（您在保存时定义的）。

Secondly, will all key value pairs with same key be shuffled to the same bucket or is the distribution of key value pairs to buckets random?

Spark 2.3（我相信早期版本的工作方式相似）对每个分区进行分桶（编写器任务），即每个分区都有您定义的桶数。

在上述情况下，您最终将得到 8（分区）x 4（存储桶）= 32 个存储桶文件（_SUCCESS 有两行，header 给出 34） .

$ ls -ltr spark-warehouse/bucketed_4_id | wc -l
      34

Does specifying a partitioner (hash/range) have any effect on this distribution?

我认为是的，因为分区程序用于跨分区分布数据。

buckets和partitions是什么关系？

What is the relationship between buckets and partitions?

apache-spark

apache-spark-sql