如何在 HiveQL 中计算每个城市的 most 流行设备,os,浏览器?
How do I calculate the most popular device, os, browser in each city in HiveQL?
我有一个 table,其中包含用户代理字符串(我将其解析为 browser
、os
和 device
列)和城市 id
'秒。我想为每个 city
.
计算最受欢迎的 browser
、os
和 device
这是我的尝试:
select device os, browser, name, MAX(hits) as pop from
(select uap.device, uap.os, uap.browser, name, COUNT(*) as hits
from (select * from browserdata join citydata on cityid=id) t
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser
GROUP BY uap.device, uap.os, uap.browser, name) t2
GROUP BY name;
所以,最里面的子查询,别名 t
只是将我的 table 加入到另一个 table 上,将 id
映射到城市 name
s ,所以我可以在输出中看到实际的 name
,而不是城市 id
。
然后,名为t2
的子查询统计复合键的个数(device
、browser
、os
、city
)。并且外部查询将所有内容分组为 name
windows 并提取具有最大用户数的行。
我得到的错误是这样的:
FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'device'
我明白什么意思了。它说我需要将 device
包含到 group by
中,但如果我这样做,那么它不会计算我想要的。如何修复我的查询?
此外,我注意到我的一些配置单元在 mapreduce 上查询 运行 但在 tez 上不 运行。这是为什么?
WITH t1 as
(select * from browserdata join citydata on cityid=id),
t2 as
(select uap.device as device, uap.os as os, uap.browser as browser, name as cityname
from t1
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser),
t3 as
(SELECT t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, COUNT(*) as count FROM t2 GROUP BY t2.cityname, t2.os, t2.device, t2.browser),
t4 as
(select cityname, MAX(count) as maximum from t3 group by cityname)
select t4.cityname, t4.maximum, t3.device, t3.os, t3.browser
from t4 join t3 on t4.cityname=t3.cityname and t4.maximum=t3.count;
这是有效的,但是我想知道是否有办法优化它...
使用分析函数可以消除不必要的连接:
WITH
t1 as
(select * from browserdata join citydata on cityid=id),
t2 as
(select uap.device as device, uap.os as os, uap.browser as browser, name as cityname
from t1
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser),
t3 as
(select t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, count(*) as count from t2 group by t2.cityname, t2.os, t2.device, t2.browser)
select cityname, maximum, device, os, browser
from
(select cityname, device, browser, os,
max(count) over(partition by cityname) as maximum,
dense_rank() over (partition by cityname order by count desc ) as rnk
from t3
) s where rnk =1
;
我有一个 table,其中包含用户代理字符串(我将其解析为 browser
、os
和 device
列)和城市 id
'秒。我想为每个 city
.
browser
、os
和 device
这是我的尝试:
select device os, browser, name, MAX(hits) as pop from
(select uap.device, uap.os, uap.browser, name, COUNT(*) as hits
from (select * from browserdata join citydata on cityid=id) t
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser
GROUP BY uap.device, uap.os, uap.browser, name) t2
GROUP BY name;
所以,最里面的子查询,别名 t
只是将我的 table 加入到另一个 table 上,将 id
映射到城市 name
s ,所以我可以在输出中看到实际的 name
,而不是城市 id
。
然后,名为t2
的子查询统计复合键的个数(device
、browser
、os
、city
)。并且外部查询将所有内容分组为 name
windows 并提取具有最大用户数的行。
我得到的错误是这样的:
FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'device'
我明白什么意思了。它说我需要将 device
包含到 group by
中,但如果我这样做,那么它不会计算我想要的。如何修复我的查询?
此外,我注意到我的一些配置单元在 mapreduce 上查询 运行 但在 tez 上不 运行。这是为什么?
WITH t1 as
(select * from browserdata join citydata on cityid=id),
t2 as
(select uap.device as device, uap.os as os, uap.browser as browser, name as cityname
from t1
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser),
t3 as
(SELECT t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, COUNT(*) as count FROM t2 GROUP BY t2.cityname, t2.os, t2.device, t2.browser),
t4 as
(select cityname, MAX(count) as maximum from t3 group by cityname)
select t4.cityname, t4.maximum, t3.device, t3.os, t3.browser
from t4 join t3 on t4.cityname=t3.cityname and t4.maximum=t3.count;
这是有效的,但是我想知道是否有办法优化它...
使用分析函数可以消除不必要的连接:
WITH
t1 as
(select * from browserdata join citydata on cityid=id),
t2 as
(select uap.device as device, uap.os as os, uap.browser as browser, name as cityname
from t1
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser),
t3 as
(select t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, count(*) as count from t2 group by t2.cityname, t2.os, t2.device, t2.browser)
select cityname, maximum, device, os, browser
from
(select cityname, device, browser, os,
max(count) over(partition by cityname) as maximum,
dense_rank() over (partition by cityname order by count desc ) as rnk
from t3
) s where rnk =1
;