spark parquet 启用字典
spark parquet enable dictionary
我是 运行 编写 parquet 的 spark 作业。我想为写入的文件启用字典编码。当我检查文件时,我看到它们是“普通字典”。但是,我没有看到这些列的任何统计信息
如果我遗漏了什么,请告诉我
val ip = spark.read.parquet("/home/hadoop/work/cube/data/date=2020-02-01")
val ip1 = ip.groupBy("asn","country_code").agg(sum("total_hits").as("total_hits")).sort("asn")
ip1.write.parquet("/home/hadoop/work/cube/test_parquet_dictionary/att3")
这是我在 meta 中看到的内容
parquet-tools meta hdfs://<cluster>:50001//home/hadoop/work/cube/test_parquet_dictionary/att3/part-00190-522feb80-6fd7-4147-87f4-781b4e2c3599-c000.snappy.parquet
creator: parquet-mr version 1.5.0-cdh5.13.3 (build ${buildNumber})
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"asn","type":"string","nullable":true,"metadata":{}},{"name":"country_code","type":"string","nullable":true,"metadata":{}},{"name":"total_ [more]...
file schema: spark_schema
---------------------------------------------------------------------------------------------------------
asn: OPTIONAL BINARY O:UTF8 R:0 D:1
country_code: OPTIONAL BINARY O:UTF8 R:0 D:1
total_hits: OPTIONAL INT64 R:0 D:1
row group 1: RC:149 TS:2750
---------------------------------------------------------------------------------------------------------
asn: BINARY SNAPPY DO:0 FPO:4 SZ:665/1361/2.05 VC:149 ENC:BIT_PACKED,PLAIN,RLE
country_code: BINARY SNAPPY DO:0 FPO:669 SZ:300/331/1.10 VC:149 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
total_hits: INT64 SNAPPY DO:0 FPO:969 SZ:668/1058/1.58 VC:149 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
这也是页脚信息
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
asn: BINARY SNAPPY DO:0 FPO:4 SZ:665/1361/2.05 VC:149 ENC:RLE,PLAIN,BIT_PACKED
country_code: BINARY SNAPPY DO:0 FPO:669 SZ:300/331/1.10 VC:149 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
total_hits: INT64 SNAPPY DO:0 FPO:969 SZ:668/1058/1.58 VC:149 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
asn TV=149 RL=0 DL=1
-----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1324 VC:149
country_code TV=149 RL=0 DL=1 DS: 30 DE:PLAIN_DICTIONARY
---------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:104 VC:149
total_hits TV=149 RL=0 DL=1 DS: 107 DE:PLAIN_DICTIONARY
---------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:142 VC:149
Spark 版本 ==>
sc.version
res10: String = 2.4.0.cloudera2
得到答案。我使用的镶木地板工具版本是 1.6。升级到 1.10 解决了问题
我是 运行 编写 parquet 的 spark 作业。我想为写入的文件启用字典编码。当我检查文件时,我看到它们是“普通字典”。但是,我没有看到这些列的任何统计信息
如果我遗漏了什么,请告诉我
val ip = spark.read.parquet("/home/hadoop/work/cube/data/date=2020-02-01")
val ip1 = ip.groupBy("asn","country_code").agg(sum("total_hits").as("total_hits")).sort("asn")
ip1.write.parquet("/home/hadoop/work/cube/test_parquet_dictionary/att3")
这是我在 meta 中看到的内容
parquet-tools meta hdfs://<cluster>:50001//home/hadoop/work/cube/test_parquet_dictionary/att3/part-00190-522feb80-6fd7-4147-87f4-781b4e2c3599-c000.snappy.parquet
creator: parquet-mr version 1.5.0-cdh5.13.3 (build ${buildNumber})
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"asn","type":"string","nullable":true,"metadata":{}},{"name":"country_code","type":"string","nullable":true,"metadata":{}},{"name":"total_ [more]...
file schema: spark_schema
---------------------------------------------------------------------------------------------------------
asn: OPTIONAL BINARY O:UTF8 R:0 D:1
country_code: OPTIONAL BINARY O:UTF8 R:0 D:1
total_hits: OPTIONAL INT64 R:0 D:1
row group 1: RC:149 TS:2750
---------------------------------------------------------------------------------------------------------
asn: BINARY SNAPPY DO:0 FPO:4 SZ:665/1361/2.05 VC:149 ENC:BIT_PACKED,PLAIN,RLE
country_code: BINARY SNAPPY DO:0 FPO:669 SZ:300/331/1.10 VC:149 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
total_hits: INT64 SNAPPY DO:0 FPO:969 SZ:668/1058/1.58 VC:149 ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
这也是页脚信息
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
asn: BINARY SNAPPY DO:0 FPO:4 SZ:665/1361/2.05 VC:149 ENC:RLE,PLAIN,BIT_PACKED
country_code: BINARY SNAPPY DO:0 FPO:669 SZ:300/331/1.10 VC:149 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
total_hits: INT64 SNAPPY DO:0 FPO:969 SZ:668/1058/1.58 VC:149 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
asn TV=149 RL=0 DL=1
-----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN SZ:1324 VC:149
country_code TV=149 RL=0 DL=1 DS: 30 DE:PLAIN_DICTIONARY
---------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:104 VC:149
total_hits TV=149 RL=0 DL=1 DS: 107 DE:PLAIN_DICTIONARY
---------------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY SZ:142 VC:149
Spark 版本 ==>
sc.version
res10: String = 2.4.0.cloudera2
得到答案。我使用的镶木地板工具版本是 1.6。升级到 1.10 解决了问题