如何在 HiveQL 中计算每个城市的 most 流行设备,os,浏览器?

How do I calculate the most popular device, os, browser in each city in HiveQL?

我有一个 table,其中包含用户代理字符串(我将其解析为 browserosdevice 列)和城市 id'秒。我想为每个 city.

计算最受欢迎的 browserosdevice

这是我的尝试:

select device os, browser, name, MAX(hits) as pop from 
(select uap.device, uap.os, uap.browser, name, COUNT(*) as hits 
from (select * from browserdata join citydata on cityid=id) t 
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser 
GROUP BY uap.device, uap.os, uap.browser, name) t2 
GROUP BY name;

所以,最里面的子查询,别名 t 只是将我的 table 加入到另一个 table 上,将 id 映射到城市 names ,所以我可以在输出中看到实际的 name,而不是城市 id

然后,名为t2的子查询统计复合键的个数(devicebrowseroscity)。并且外部查询将所有内容分组为 name windows 并提取具有最大用户数的行。

我得到的错误是这样的:

FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'device'

我明白什么意思了。它说我需要将 device 包含到 group by 中,但如果我这样做,那么它不会计算我想要的。如何修复我的查询?

此外,我注意到我的一些配置单元在 mapreduce 上查询 运行 但在 tez 上不 运行。这是为什么?

WITH t1 as 
(select * from browserdata join citydata on cityid=id),

t2 as 
(select uap.device as device, uap.os as os, uap.browser as browser, name as cityname 
from t1 
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser),

t3 as
  (SELECT t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, COUNT(*) as count FROM t2 GROUP BY t2.cityname, t2.os, t2.device, t2.browser),

t4 as
    (select cityname, MAX(count) as maximum from t3 group by cityname)

select t4.cityname, t4.maximum, t3.device, t3.os, t3.browser
from t4 join t3 on t4.cityname=t3.cityname and t4.maximum=t3.count;

这是有效的,但是我想知道是否有办法优化它...

使用分析函数可以消除不必要的连接:

WITH 
t1 as 
(select * from browserdata join citydata on cityid=id),

t2 as 
(select uap.device as device, uap.os as os, uap.browser as browser, name as cityname 
from t1 
lateral view ParseUserAgentUDTF(UserAgent) uap as device, os, browser),

t3 as
(select t2.cityname as cityname, t2.device as device, t2.browser as browser, t2.os as os, count(*) as count from t2 group by t2.cityname, t2.os, t2.device, t2.browser)

select cityname, maximum,  device, os, browser
 from
     (select cityname, device, browser, os, 
             max(count) over(partition by cityname)                         as maximum,
             dense_rank() over (partition by cityname order by count desc ) as rnk      
      from t3
     ) s  where rnk =1 
;