Hive 查询优化
Hive query optimization
我的要求是获取具有超过 1 个电子邮件 ID 且类型 = 1 的学生的 ID 和姓名。
我正在使用类似
的查询
select distinct b.id, b.name, b.email, b.type,a.cnt
from (
select id, count(email) as cnt
from (
select distinct id, email
from table1
) c
group by id
) a
join table1 b on a.id = b.id
where b.type=1
order by b.id
请告诉我这个好还是任何更简单的版本可用。
Sample data is like:
id name email type
123 AAA abc@xyz.com 1
123 AAA acd@xyz.com 1
123 AAA ayx@xyz.com 3
345 BBB nch@xyz.com 1
345 BBB nch@xyz.com 1
678 CCC iuy@xyz.com 1
Expected Output:
123 AAA abc@xyz.com 1 2
123 AAA acd@xyz.com 1 2
345 BBB nch@xyz.com 1 1
678 CCC iuy@xyz.com 1 1
您可以使用 group by
-> having count()
来满足此要求。
select distinct b.id
, b.name,
, b.email
, b.type
from table1 b
where id in
(select distinct id from table1 group by email, id having count(email) > 1)
and b.type=1
order by b.id
你可以尝试使用count()函数的解析方式:
SELECT sub.ID, sub.NAME
FROM (SELECT ID, NAME, COUNT (*) OVER (PARTITION BY ID, EMAIL) cnt
FROM raw.crddacia_raw) sub
WHERE sub.cnt > 1 AND sub.TYPE = 1
我强烈建议使用 window 函数。但是,Hive 不支持 count(distinct)
作为 window 函数。有不同的方法来解决这个问题。一个是dense_rank()
s的总和:
select id, name, email, type, cnt
from (select t1.*,
(dense_rank() over (partition by id order by email) +
dense_rank() over (partition by id order by email desc)
) as cnt
from table1 t1
) t
where type = 1;
我希望它比您的版本有更好的性能。但是,值得测试不同的版本以查看哪个具有更好的性能(并随时回来让其他人知道哪个更好)。
另一种方法使用 collect_set
并采用返回数组的大小来计算不同的电子邮件。
演示:
--your data example
with table1 as ( --use your table instead of this
select stack(6,
123, 'AAA', 'abc@xyz.com', 1,
123, 'AAA', 'acd@xyz.com', 1,
123, 'AAA', 'ayx@xyz.com', 3,
345, 'BBB', 'nch@xyz.com', 1,
345, 'BBB', 'nch@xyz.com', 1,
678, 'CCC', 'iuy@xyz.com', 1
) as (id, name, email, type )
)
--query
select distinct id, name, email, type,
size(collect_set(email) over(partition by id)) cnt
from table1
where type=1
结果:
id name email type cnt
123 AAA abc@xyz.com 1 2
123 AAA acd@xyz.com 1 2
345 BBB nch@xyz.com 1 1
678 CCC iuy@xyz.com 1 1
这里我们仍然需要 DISTINCT,因为解析函数不会像 345 BBB nch@xyz.com
.
那样删除重复项
这与您的查询非常相似,但在这里我在初始步骤(在内部查询中)过滤数据,这样连接就不会发生在较少的数据上
select distinct b.id,b.name,b.email,b.type,intr_table.cnt from table1 orig_table join
(
select a.id,a.type,count(a.email) as cnt from table1 as a where a.type=1 group by a
) intr_table on inter_table.id=orig_table.id,inter_table.type=orig_table.type
我的要求是获取具有超过 1 个电子邮件 ID 且类型 = 1 的学生的 ID 和姓名。
我正在使用类似
的查询select distinct b.id, b.name, b.email, b.type,a.cnt
from (
select id, count(email) as cnt
from (
select distinct id, email
from table1
) c
group by id
) a
join table1 b on a.id = b.id
where b.type=1
order by b.id
请告诉我这个好还是任何更简单的版本可用。
Sample data is like:
id name email type
123 AAA abc@xyz.com 1
123 AAA acd@xyz.com 1
123 AAA ayx@xyz.com 3
345 BBB nch@xyz.com 1
345 BBB nch@xyz.com 1
678 CCC iuy@xyz.com 1
Expected Output:
123 AAA abc@xyz.com 1 2
123 AAA acd@xyz.com 1 2
345 BBB nch@xyz.com 1 1
678 CCC iuy@xyz.com 1 1
您可以使用 group by
-> having count()
来满足此要求。
select distinct b.id
, b.name,
, b.email
, b.type
from table1 b
where id in
(select distinct id from table1 group by email, id having count(email) > 1)
and b.type=1
order by b.id
你可以尝试使用count()函数的解析方式:
SELECT sub.ID, sub.NAME
FROM (SELECT ID, NAME, COUNT (*) OVER (PARTITION BY ID, EMAIL) cnt
FROM raw.crddacia_raw) sub
WHERE sub.cnt > 1 AND sub.TYPE = 1
我强烈建议使用 window 函数。但是,Hive 不支持 count(distinct)
作为 window 函数。有不同的方法来解决这个问题。一个是dense_rank()
s的总和:
select id, name, email, type, cnt
from (select t1.*,
(dense_rank() over (partition by id order by email) +
dense_rank() over (partition by id order by email desc)
) as cnt
from table1 t1
) t
where type = 1;
我希望它比您的版本有更好的性能。但是,值得测试不同的版本以查看哪个具有更好的性能(并随时回来让其他人知道哪个更好)。
另一种方法使用 collect_set
并采用返回数组的大小来计算不同的电子邮件。
演示:
--your data example
with table1 as ( --use your table instead of this
select stack(6,
123, 'AAA', 'abc@xyz.com', 1,
123, 'AAA', 'acd@xyz.com', 1,
123, 'AAA', 'ayx@xyz.com', 3,
345, 'BBB', 'nch@xyz.com', 1,
345, 'BBB', 'nch@xyz.com', 1,
678, 'CCC', 'iuy@xyz.com', 1
) as (id, name, email, type )
)
--query
select distinct id, name, email, type,
size(collect_set(email) over(partition by id)) cnt
from table1
where type=1
结果:
id name email type cnt
123 AAA abc@xyz.com 1 2
123 AAA acd@xyz.com 1 2
345 BBB nch@xyz.com 1 1
678 CCC iuy@xyz.com 1 1
这里我们仍然需要 DISTINCT,因为解析函数不会像 345 BBB nch@xyz.com
.
这与您的查询非常相似,但在这里我在初始步骤(在内部查询中)过滤数据,这样连接就不会发生在较少的数据上
select distinct b.id,b.name,b.email,b.type,intr_table.cnt from table1 orig_table join
(
select a.id,a.type,count(a.email) as cnt from table1 as a where a.type=1 group by a
) intr_table on inter_table.id=orig_table.id,inter_table.type=orig_table.type