HiveQL - 将多级小计加入现有 table
HiveQL - join multi-level subtotals to existing table
我的目标是确定各个级别的各个组织的规模。假设我们有三个组织 'A'、'B' 和 'C',每个组织都由多个部门组成,并且在团队中有进一步细分的成员。如下所示:
Org. Dep. Tm. Member
A 1 I name1
A 1 I name2
A 1 I name3
A 1 II name4
A 2 I name5
A 2 I name6
B 1 I name7
B 1 II name8
B 1 II name9
B 1 II name10
B 2 I name11
B 2 I name12
B 2 II name13
B 2 II name14
B 2 III name15
B 2 III name16
C 1 I name17
C 1 I name18
C 1 I name19
C 1 I name20
C 1 I name21
现在,我想知道每个成员各自的组织、部门有多大。和TM。是这样的:
Org. Dep. Tm. Member org dep tm
A 1 I name1 6 4 3
A 1 I name2 6 4 3
A 1 I name3 6 4 3
A 1 II name4 6 4 1
A 2 I name5 6 2 2
A 2 I name6 6 2 2
B 1 I name7 10 4 1
B 1 II name8 10 4 3
B 1 II name9 10 4 3
B 1 II name10 10 4 3
B 2 I name11 10 6 2
B 2 I name12 10 6 2
B 2 II name13 10 6 2
B 2 II name14 10 6 2
B 2 III name15 10 6 2
B 2 III name16 10 6 2
C 1 I name17 5 5 5
C 1 I name18 5 5 5
C 1 I name19 5 5 5
C 1 I name20 5 5 5
C 1 I name21 5 5 5
我最初的想法是通过多个 LEFT JOINS 来聚合不同的级别,但是这种扩展性很差,因为您需要为每个聚合级别创建一个新的联接。有没有办法在一条语句中高效地做到这一点?
使用window个函数:
select org, dep, tm,
count(*) over (partition by org) as org_cnt,
count(*) over (partition by org, dep) as dep_cnt,
count(*) over (partition by org, dep, tm) as tm_cnt
from t;
列是分层的,因此 dep
和 tm
需要更高级别的层次结构。
编辑:
如果 Hive 不支持 count(distinct)
而你需要它,那么你可以这样做:
select org, dep, tm,
sum(case when seqnum_o = 1 then 1 else 0 end) over (partition by org) as org_cnt,
sum(case when seqnum_od = 1 then 1 else 0 end) over (partition by org, dep) as dep_cnt,
sum(case when seqnum_odt = 1 then 1 else 0 end) over (partition by org, dep, tm) as tm_cnt
from (select t.*,
row_number() over partition by org, memberid order by org) as seqnum_o,
row_number() over partition by org, dep, memberid order by org) as seqnum_od,
row_number() over partition by org, dep, tm, memberid order by org) as seqnum_odt
from t
) t;
我的目标是确定各个级别的各个组织的规模。假设我们有三个组织 'A'、'B' 和 'C',每个组织都由多个部门组成,并且在团队中有进一步细分的成员。如下所示:
Org. Dep. Tm. Member
A 1 I name1
A 1 I name2
A 1 I name3
A 1 II name4
A 2 I name5
A 2 I name6
B 1 I name7
B 1 II name8
B 1 II name9
B 1 II name10
B 2 I name11
B 2 I name12
B 2 II name13
B 2 II name14
B 2 III name15
B 2 III name16
C 1 I name17
C 1 I name18
C 1 I name19
C 1 I name20
C 1 I name21
现在,我想知道每个成员各自的组织、部门有多大。和TM。是这样的:
Org. Dep. Tm. Member org dep tm
A 1 I name1 6 4 3
A 1 I name2 6 4 3
A 1 I name3 6 4 3
A 1 II name4 6 4 1
A 2 I name5 6 2 2
A 2 I name6 6 2 2
B 1 I name7 10 4 1
B 1 II name8 10 4 3
B 1 II name9 10 4 3
B 1 II name10 10 4 3
B 2 I name11 10 6 2
B 2 I name12 10 6 2
B 2 II name13 10 6 2
B 2 II name14 10 6 2
B 2 III name15 10 6 2
B 2 III name16 10 6 2
C 1 I name17 5 5 5
C 1 I name18 5 5 5
C 1 I name19 5 5 5
C 1 I name20 5 5 5
C 1 I name21 5 5 5
我最初的想法是通过多个 LEFT JOINS 来聚合不同的级别,但是这种扩展性很差,因为您需要为每个聚合级别创建一个新的联接。有没有办法在一条语句中高效地做到这一点?
使用window个函数:
select org, dep, tm,
count(*) over (partition by org) as org_cnt,
count(*) over (partition by org, dep) as dep_cnt,
count(*) over (partition by org, dep, tm) as tm_cnt
from t;
列是分层的,因此 dep
和 tm
需要更高级别的层次结构。
编辑:
如果 Hive 不支持 count(distinct)
而你需要它,那么你可以这样做:
select org, dep, tm,
sum(case when seqnum_o = 1 then 1 else 0 end) over (partition by org) as org_cnt,
sum(case when seqnum_od = 1 then 1 else 0 end) over (partition by org, dep) as dep_cnt,
sum(case when seqnum_odt = 1 then 1 else 0 end) over (partition by org, dep, tm) as tm_cnt
from (select t.*,
row_number() over partition by org, memberid order by org) as seqnum_o,
row_number() over partition by org, dep, memberid order by org) as seqnum_od,
row_number() over partition by org, dep, tm, memberid order by org) as seqnum_odt
from t
) t;