在 Hive 中每天获取前 N 行 - rank()
Get top N rows per day in Hive - rank()
我有这个 table,每行捐赠一个销售:
sale_date salesman sale_item_id
20170102 JohnSmith 309
20170102 JohnSmith 292
20170103 AlexHam 93
我试图每天获得前 20 名销售员,我想到了这个:
SELECT sale_date, salesman, sale_count, row_num
FROM (
SELECT sale_date, salesman,
count(*) as sale_count,
rank() over (partition by sale_date order by sale_count desc) as row_num
from salesforce.sales_data
) T
WHERE sale_date between '20170101' and '20170110'
and row_num <= 20
但我得到:
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 5:35 Expression not in GROUP BY key 'sale_date'
不过我不确定分组会在什么时候生效。有人可以帮忙吗?发送!
您在子查询中缺少 group by
:
SELECT sale_date, salesman, sale_count, row_num
FROM (SELECT sale_date, salesman,
count(*) as sale_count,
rank() over (partition by sale_date order by count(*) desc) as row_num
FROM salesforce.sales_data
GROUP BY sale_date, salesman
) T
WHERE sale_date between '20170101' and '20170110' and row_num <= 20;
我认为 Hive 会接受 order by
、order by sale_count desc
中的列别名。
另请注意,如果存在并列,您可以得到多于或少于 20 行。如果您恰好需要 20 行,您可能需要 row_number()
。
试试这个
SELECT sale_date, salesman, sale_count, row_num from (
SELECT sale_date, salesman, sale_count,
rank() over (partition by sale_date order by sale_count desc) as row_num
from
(
SELECT sale_date, salesman,
count(*) over (partition by salesman) as sale_count
from employee
) t1
) t2 where sale_date between '20170101' and '20170110'
and row_num <= 20;
WHERE sale_date between '20170101' and '20170110'
and row_num <= 20
编辑和测试。你的问题本质上是你在为你的 over 子句计算它之前尝试使用计数,如果你在推销员的子查询分区中计算你的计数,它将解决问题。您不能在销售查询中进行分组,如果这样做,您将无法访问 sale_date。
我有这个 table,每行捐赠一个销售:
sale_date salesman sale_item_id
20170102 JohnSmith 309
20170102 JohnSmith 292
20170103 AlexHam 93
我试图每天获得前 20 名销售员,我想到了这个:
SELECT sale_date, salesman, sale_count, row_num
FROM (
SELECT sale_date, salesman,
count(*) as sale_count,
rank() over (partition by sale_date order by sale_count desc) as row_num
from salesforce.sales_data
) T
WHERE sale_date between '20170101' and '20170110'
and row_num <= 20
但我得到:
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 5:35 Expression not in GROUP BY key 'sale_date'
不过我不确定分组会在什么时候生效。有人可以帮忙吗?发送!
您在子查询中缺少 group by
:
SELECT sale_date, salesman, sale_count, row_num
FROM (SELECT sale_date, salesman,
count(*) as sale_count,
rank() over (partition by sale_date order by count(*) desc) as row_num
FROM salesforce.sales_data
GROUP BY sale_date, salesman
) T
WHERE sale_date between '20170101' and '20170110' and row_num <= 20;
我认为 Hive 会接受 order by
、order by sale_count desc
中的列别名。
另请注意,如果存在并列,您可以得到多于或少于 20 行。如果您恰好需要 20 行,您可能需要 row_number()
。
试试这个
SELECT sale_date, salesman, sale_count, row_num from (
SELECT sale_date, salesman, sale_count,
rank() over (partition by sale_date order by sale_count desc) as row_num
from
(
SELECT sale_date, salesman,
count(*) over (partition by salesman) as sale_count
from employee
) t1
) t2 where sale_date between '20170101' and '20170110'
and row_num <= 20;
WHERE sale_date between '20170101' and '20170110'
and row_num <= 20
编辑和测试。你的问题本质上是你在为你的 over 子句计算它之前尝试使用计数,如果你在推销员的子查询分区中计算你的计数,它将解决问题。您不能在销售查询中进行分组,如果这样做,您将无法访问 sale_date。