Spark 是否受益于 persistent table 中的 `sortBy`？

Question

Spark v2.4 无 Hive

Spark 从 bucketBy 中受益，因为它知道 DataFrame 具有正确的分区。 sortBy 呢？

spark.range(100, numPartitions=1).write.bucketBy(3, 'id').sortBy('id').saveAsTable('df')

# No need to `repartition`.
spark.table('df').repartition(3, 'id').explain()
# == Physical Plan ==
# *(1) FileScan parquet default.df2[id#33620L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>, # SelectedBucketsCount: 3 out of 3

# Still need to `sortWithinPartitions`.
spark.table('df').sortWithinPartitions('id').explain()
# == Physical Plan ==
# *(1) Sort [id#33620L ASC NULLS FIRST], false, 0
# +- *(1) FileScan parquet default.df2[id#33620L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 3 out of 3

所以省略额外的repartition。但是，sortWithinPartitions 不是。 sortBy有用吗？我们可以使用 sortBy 来加速 table 加入吗？

Answer 1

简短回答：sortBy 在持续 table 中没有任何好处（至少目前）。

更长的答案：

Spark 和 Hive 在 bucketing 支持，尽管 Spark 可以将分桶的 DataFrame 保存到 Hive table.

首先，两个框架之间的存储单元不同：单个存储桶文件 (hive) 与每个存储桶的文件集合 (spark)。

第二,

在Hive中，每个桶都是全局排序的，可以优化查询读取数据。

在 Spark 直到这个问题 https://issues.apache.org/jira/browse/SPARK-19256 得到（希望）解决，每个文件都是单独排序的，但整个存储桶不是全局排序的。

因此，由于排序不是全局的，没有好处形成sortBy。

我希望这能回答你的问题。

Spark 是否受益于 persistent table 中的 `sortBy`？

Does Spark benefit from `sortBy` in persistent table?

apache-spark

apache-spark-sql

pyspark

pyspark-sql