查找 Hive/Impala table 的压缩详细信息

Question

我将 tables 从一种格式转换为另一种格式，从未压缩格式转换为压缩格式（Snappy、Gzip 等）。

我想我可以依靠 describe [formatted|extended] tblname，直到我读到这篇文章。 DESCRIBE Statement

上面写着

The Compressed field is not a reliable indicator of whether the table contains compressed data. It typically always shows No, because the compression settings only apply during the session that loads data and are not stored persistently with the table metadata.

我如何知道 table 是否被压缩以及使用的是什么编解码器？我不介意使用 Spark 获取该信息。

Answer 1

首先想到的是检查这个 Hive/MR 属性：

hive.exec.compress.output=
mapreduce.output.fileoutputformat.compress=
mapreduce.output.fileoutputformat.compress.codec=   
mapreduce.output.fileoutputformat.compress.type=

Answer 2

回答我的问题：

对于 Avro 数据文件：avro-tools getmeta filename

对于 Parquet 数据文件：parquet-tools meta filename

Answer 3

如您所说，'describe formatted' 和 'show create table' 方法并不总是保证包含正确的压缩格式信息。

识别压缩编解码器和存储格式的最可靠方法是转到 table 文件的 HDFS 位置，并查看它们的扩展名：

hdfs dfs -ls -r /hdfspath/

例如，在 snappy 中压缩的 ORC 文件应以 .snappy.orc.

结尾

查找 Hive/Impala table 的压缩详细信息

Finding Hive/Impala table's compression details

hive

codec

impala

apache-spark

pyspark