HiveQL - 将多级小计加入现有 table

HiveQL - join multi-level subtotals to existing table

我的目标是确定各个级别的各个组织的规模。假设我们有三个组织 'A'、'B' 和 'C',每个组织都由多个部门组成,并且在团队中有进一步细分的成员。如下所示:

Org.    Dep.    Tm. Member
A       1       I   name1
A       1       I   name2
A       1       I   name3
A       1       II  name4
A       2       I   name5
A       2       I   name6
B       1       I   name7
B       1       II  name8
B       1       II  name9
B       1       II  name10
B       2       I   name11
B       2       I   name12
B       2       II  name13
B       2       II  name14
B       2       III name15
B       2       III name16
C       1       I   name17
C       1       I   name18
C       1       I   name19
C       1       I   name20
C       1       I   name21

现在,我想知道每个成员各自的组织、部门有多大。和TM。是这样的:

Org.    Dep.    Tm. Member  org dep tm
A       1       I   name1   6   4   3
A       1       I   name2   6   4   3
A       1       I   name3   6   4   3
A       1       II  name4   6   4   1
A       2       I   name5   6   2   2
A       2       I   name6   6   2   2
B       1       I   name7   10  4   1
B       1       II  name8   10  4   3
B       1       II  name9   10  4   3
B       1       II  name10  10  4   3
B       2       I   name11  10  6   2
B       2       I   name12  10  6   2
B       2       II  name13  10  6   2
B       2       II  name14  10  6   2
B       2       III name15  10  6   2
B       2       III name16  10  6   2
C       1       I   name17  5   5   5
C       1       I   name18  5   5   5
C       1       I   name19  5   5   5
C       1       I   name20  5   5   5
C       1       I   name21  5   5   5

我最初的想法是通过多个 LEFT JOINS 来聚合不同的级别,但是这种扩展性很差,因为您需要为每个聚合级别创建一个新的联接。有没有办法在一条语句中高效地做到这一点?

使用window个函数:

select org, dep, tm,
       count(*) over (partition by org) as org_cnt,
       count(*) over (partition by org, dep) as dep_cnt,
       count(*) over (partition by org, dep, tm) as tm_cnt
from t;

列是分层的,因此 deptm 需要更高级别的层次结构。

编辑:

如果 Hive 不支持 count(distinct) 而你需要它,那么你可以这样做:

select org, dep, tm,
       sum(case when seqnum_o = 1 then 1 else 0 end) over (partition by org) as org_cnt,
       sum(case when seqnum_od = 1 then 1 else 0 end) over (partition by org, dep) as dep_cnt,
       sum(case when seqnum_odt = 1 then 1 else 0 end) over (partition by org, dep, tm) as tm_cnt
from (select t.*,
             row_number() over partition by org, memberid order by org) as seqnum_o,
             row_number() over partition by org, dep, memberid order by org) as seqnum_od,
             row_number() over partition by org, dep, tm, memberid order by org) as seqnum_odt
      from t
     ) t;