分组时选择分类变量的最高计数
selecting the highest count for a categorical variable when grouping
我有以下 table:
custID Cat
1 A
1 B
1 B
1 B
1 C
2 A
2 A
2 C
3 B
3 C
4 A
4 C
4 C
4 C
我需要的是按 CustID 聚合的最有效方式,这样我就可以获得最频繁的类别 (cat)、第二频繁的类别和第三频繁的类别。上面的输出应该是
most freq 2nd most freq 3rd most freq
1 B A C
2 A C Null
3 B C Null
4 C A Null
当计数相同时,我真的不在乎什么是第一,什么是第二。例如,对于客户 1,第二大频率和第三大频率可以互换,因为它们每个只出现 1 次。
任何sql都可以,最好是蜂巢sql。
谢谢
尝试使用group by
两次和dense_rank()
根据cat
计数排序。其实我不是 100% 确定,但我想它也应该在配置单元中工作。
select custId,
max(case when t.rn = 1 then cat end) as [most freq],
max(case when t.rn = 2 then cat end) as [2nd most freq],
max(case when t.rn = 3 then cat end) as [3th most freq]
from
(
select custId, cat, dense_rank() over (partition by custId order by count(*) desc) rn
from your_table
group by custId, cat
) t
group by custId
根据评论我添加了符合 Hive 的稍微修改的解决方案 SQL
select custId,
max(case when t.rn = 1 then cat else null end) as most_freq,
max(case when t.rn = 2 then cat else null end) as 2nd_most_freq,
max(case when t.rn = 3 then cat else null end) as 3th_most_freq
from
(
select custId, cat, dense_rank() over (partition by custId order by ct desc) rn
from (
select custId, cat, count(*) ct
from your_table
group by custId, cat
) your_table_with_counts
) t
group by custId
SELECT journal, count(*) as frequency
FROM ${hiveconf:TNHIVE}
WHERE journal IS NOT NULL
GROUP BY journal
ORDER BY frequency DESC
LIMIT 5;
我有以下 table:
custID Cat
1 A
1 B
1 B
1 B
1 C
2 A
2 A
2 C
3 B
3 C
4 A
4 C
4 C
4 C
我需要的是按 CustID 聚合的最有效方式,这样我就可以获得最频繁的类别 (cat)、第二频繁的类别和第三频繁的类别。上面的输出应该是
most freq 2nd most freq 3rd most freq
1 B A C
2 A C Null
3 B C Null
4 C A Null
当计数相同时,我真的不在乎什么是第一,什么是第二。例如,对于客户 1,第二大频率和第三大频率可以互换,因为它们每个只出现 1 次。
任何sql都可以,最好是蜂巢sql。
谢谢
尝试使用group by
两次和dense_rank()
根据cat
计数排序。其实我不是 100% 确定,但我想它也应该在配置单元中工作。
select custId,
max(case when t.rn = 1 then cat end) as [most freq],
max(case when t.rn = 2 then cat end) as [2nd most freq],
max(case when t.rn = 3 then cat end) as [3th most freq]
from
(
select custId, cat, dense_rank() over (partition by custId order by count(*) desc) rn
from your_table
group by custId, cat
) t
group by custId
根据评论我添加了符合 Hive 的稍微修改的解决方案 SQL
select custId,
max(case when t.rn = 1 then cat else null end) as most_freq,
max(case when t.rn = 2 then cat else null end) as 2nd_most_freq,
max(case when t.rn = 3 then cat else null end) as 3th_most_freq
from
(
select custId, cat, dense_rank() over (partition by custId order by ct desc) rn
from (
select custId, cat, count(*) ct
from your_table
group by custId, cat
) your_table_with_counts
) t
group by custId
SELECT journal, count(*) as frequency
FROM ${hiveconf:TNHIVE}
WHERE journal IS NOT NULL
GROUP BY journal
ORDER BY frequency DESC
LIMIT 5;