Hive/Impala 字符串分区键与整数分区键的性能对比

Hive/Impala performance with string partition key vs Integer partition key

是否建议将数字列用于分区键？当我们对数字列分区和字符串列分区进行 select 查询时，性能会有什么不同吗？

不，没有这样的推荐。考虑一下：问题是 Hive 中的分区表示是一个名称类似于 'key=value' 的文件夹，或者它可以只是 'value' 但无论如何它是字符串文件夹名称。因此它被存储为字符串并在 read/write 期间被转换。分区键值未打包在数据文件中且未压缩。

由于 map-reduce 和 Impalla 的 distributed/parallel 特性，您永远不会注意到查询处理性能的差异。此外，所有数据都将被序列化以在处理阶段之间传递，然后再次反序列化并转换为某种类型，对于同一个查询，这可能会发生多次。

分布式处理和 serializing/deserializing 数据产生了大量开销。实际上只有数据的大小很重要。 table（文件大小）越小，运行速度越快。但是您不会通过限制类型来提高性能。

用作分区键的大字符串值会影响元数据数据库性能，以及正在处理的分区数也会影响性能。再次相同：这里只有数据大小很重要，而不是类型。

1, 0 可能比 'Yes', 'No' 更好，只是因为大小。在许多情况下，压缩和并行性可以使这种差异可以忽略不计。

好吧，如果你查阅官方 Impala 文档就会有所不同。

我将粘贴文档中的部分，而不是详细说明，因为我认为它说得很好：

"Although it might be convenient to use STRING columns for partition keys, even when those columns contain numbers, for performance and scalability it is much better to use numeric columns as partition keys whenever practical. Although the underlying HDFS directory name might be the same in either case, the in-memory storage for the partition key columns is more compact, and computations are faster, if partition key columns such as YEAR, MONTH, DAY and so on are declared as INT, SMALLINT, and so on."

参考：https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_string.html

Hive/Impala 字符串分区键与整数分区键的性能对比

Hive/Impala performance with string partition key vs Integer partition key

hive

impala

apache-spark