Hive 导出到 AVRO 在架构中没有列名

Question

我尝试在 Hive 中创建一个 table 并想将其导出为 Avro 格式。

最终我想将此 avro 文件加载到 Google BigQuery。由于某种原因，导出后 AVRO 模式没有正确的列名。

create table if not exists test_txt (id int, name varchar(40)); 
insert into test values (1, "AK");
insert overwrite directory "/tmp/test" stored as avro select * from test;
!sh hadoop fs -cat /tmp/test/*;

输出的列名称应为 id、name，但翻译为 _col0、_col1。

Objavro.schema▒{"type":"record","name":"baseRecord","fields":[{"name":"_col0","type":["null","int"],"default":null},{"name":"_col1","type":["null",{"type":"string","logicalType":"varchar","maxLength":40}],"default":null}]}▒Bh▒▒δ*@▒x~AK▒Bh▒▒δ*@▒x~

谢谢，

AK

Answer 1

这似乎是使用 insert overwrite directory 子句导出时的预期行为。 This 较旧的线程也是关于同样的问题。它相当古老，但我相信结论仍然是正确的（至少我找不到保留列名的直接方法）。它确实包含一些解决此问题的技巧，因此可能值得一读。

Answer 2

如果需要将 avro 二进制文件导出到单个文件以供进一步摄取（在我的 BigQuery 上下文中），则不要使用 hadoop cat / insert overwrite 语句。使用 avro-tools 并连接到一个大的 avro 文件。

hadoop jar avro-tools-1.8.2.jar concat /tmp/test_avro/* big_avro_table.avro

Hive 导出到 AVRO 在架构中没有列名

Hive export to AVRO not having column names in the schema

hive

avro