Hive 查询逻辑和优化

Hive Query logic and Optimization

我有以下格式的数据:

输入

**ID     col1     Rank**
ID1      C1_abc      R1_1
ID1      C1_xce      R1_2
ID1      C1_fde      R1_3
ID1      C1_sde      R1_4
ID2      C1_sds      R1_1
ID2      C1_hhh      R1_2
ID3      C1_aaa      R1_1
ID4      C1_asw      R1_1
ID4      C1_eee      R1_2
ID4      C1_ttt      R1_3

输出:

**ID    col1    col2      col3**
1     C1_abc     C1_xce    C1_fde      
2     C1_sds     C1_hhh    null
3     C1_aaa     null      null
4     C1_asw     C1_eee    C1_ttt

我想使用配置单元脚本来实现。我知道多种实现方式,但由于数据量很大,需要最优化的实现方式。

只使用条件聚合:

select id,
       max(case when rank = 1 then col1 end) as col1,
       max(case when rank = 2 then col1 end) as col2,
       max(case when rank = 3 then col1 end) as col3
from t
where t1.rank in (1, 2, 3)
group by id;

备选方案是多路连接:

select t1.id, t1.col1, t2.col1 as col2, t3.col1 as col3
from t t1 left join
     t t2
     on t1.rank = 1 and t2.rank = 2 and t1.id = t2.id left join
     t t3
     on t1.id = t3.id and t3.rank = 3;

您可能需要同时尝试两者,看看哪个运行得更快。它可能会因您的数据而异。