Hive select 查询 return 前 100 个语法错误?
Hive select query return top 100 syntax error?
这是我的 Hive 查询,直接来自 TPC-DS 工具包:
WITH customer_total_return
AS (SELECT sr_customer_sk AS ctr_customer_sk,
sr_store_sk AS ctr_store_sk,
Sum(sr_fee) AS ctr_total_return
FROM store_returns,
date_dim
WHERE sr_returned_date_sk = d_date_sk
AND d_year = 2000
GROUP BY sr_customer_sk,
sr_store_sk)
SELECT TOP 100 c_customer_id
FROM customer_total_return ctr1,
store,
customer
WHERE ctr1.ctr_total_return > (SELECT Avg(ctr_total_return) * 1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'TN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id;
但是,我在尝试 运行 时遇到以下错误:
FAILED: ParseException line 11:11 cannot recognize input near 'TOP'
'100' 'c_customer_id' in selection target
我的理解是 TOP 100
在 HiveQL 中在句法上无效。我怎样才能正确重写它?
使用 LIMIT 而不是 TOP,如下所示:
WITH customer_total_return
AS (SELECT sr_customer_sk AS ctr_customer_sk,
sr_store_sk AS ctr_store_sk,
Sum(sr_fee) AS ctr_total_return
FROM store_returns,
date_dim
WHERE sr_returned_date_sk = d_date_sk
AND d_year = 2000
GROUP BY sr_customer_sk,
sr_store_sk)
SELECT c_customer_id
FROM customer_total_return ctr1,
store,
customer
WHERE ctr1.ctr_total_return > (SELECT Avg(ctr_total_return) * 1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'TN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id
LIMIT 100;
这是一个多层次查询的坏例子。我会建议:
WITH customer_total_return AS (
SELECT sr.sr_customer_sk AS ctr_customer_sk,
sr.sr_store_sk AS ctr_store_sk,
SUM(sr.sr_fee) AS ctr_total_return,
AVG(SUM(sr.sr_fee)) OVER (PARTITION BY sr.sr_store_sk) as avg_store_sr_fee
FROM store_returns sr JOIN
date_dim d
ON sr.sr_returned_date_sk = d.d_date_sk
WHERE d_year = 2000
GROUP BY sr_customer_sk, sr_store_sk
)
SELECT c.c_customer_id
FROM customer_total_return ctr JOIN
store s
ON s.s_store_sk = ctr.ctr_store_sk JOIN
customer c
ON ctr.ctr_customer_sk = c.c_customer_sk
WHERE ctr.ctr_total_return > 1.2 * avg_store_sr_fee AND
s.s_state = 'TN'
ORDER BY c.c_customer_id
LIMIT 100;
备注:
- 从不 在
FROM
子句中使用逗号。 始终使用正确、明确的标准JOIN
语法。
- 限定所有列引用,尤其是当一个查询有多个 table 引用时。
- 不需要计算平均值的子查询。
- Hive 使用
LIMIT
,而不是 TOP
。
这是我的 Hive 查询,直接来自 TPC-DS 工具包:
WITH customer_total_return
AS (SELECT sr_customer_sk AS ctr_customer_sk,
sr_store_sk AS ctr_store_sk,
Sum(sr_fee) AS ctr_total_return
FROM store_returns,
date_dim
WHERE sr_returned_date_sk = d_date_sk
AND d_year = 2000
GROUP BY sr_customer_sk,
sr_store_sk)
SELECT TOP 100 c_customer_id
FROM customer_total_return ctr1,
store,
customer
WHERE ctr1.ctr_total_return > (SELECT Avg(ctr_total_return) * 1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'TN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id;
但是,我在尝试 运行 时遇到以下错误:
FAILED: ParseException line 11:11 cannot recognize input near 'TOP' '100' 'c_customer_id' in selection target
我的理解是 TOP 100
在 HiveQL 中在句法上无效。我怎样才能正确重写它?
使用 LIMIT 而不是 TOP,如下所示:
WITH customer_total_return
AS (SELECT sr_customer_sk AS ctr_customer_sk,
sr_store_sk AS ctr_store_sk,
Sum(sr_fee) AS ctr_total_return
FROM store_returns,
date_dim
WHERE sr_returned_date_sk = d_date_sk
AND d_year = 2000
GROUP BY sr_customer_sk,
sr_store_sk)
SELECT c_customer_id
FROM customer_total_return ctr1,
store,
customer
WHERE ctr1.ctr_total_return > (SELECT Avg(ctr_total_return) * 1.2
FROM customer_total_return ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk)
AND s_store_sk = ctr1.ctr_store_sk
AND s_state = 'TN'
AND ctr1.ctr_customer_sk = c_customer_sk
ORDER BY c_customer_id
LIMIT 100;
这是一个多层次查询的坏例子。我会建议:
WITH customer_total_return AS (
SELECT sr.sr_customer_sk AS ctr_customer_sk,
sr.sr_store_sk AS ctr_store_sk,
SUM(sr.sr_fee) AS ctr_total_return,
AVG(SUM(sr.sr_fee)) OVER (PARTITION BY sr.sr_store_sk) as avg_store_sr_fee
FROM store_returns sr JOIN
date_dim d
ON sr.sr_returned_date_sk = d.d_date_sk
WHERE d_year = 2000
GROUP BY sr_customer_sk, sr_store_sk
)
SELECT c.c_customer_id
FROM customer_total_return ctr JOIN
store s
ON s.s_store_sk = ctr.ctr_store_sk JOIN
customer c
ON ctr.ctr_customer_sk = c.c_customer_sk
WHERE ctr.ctr_total_return > 1.2 * avg_store_sr_fee AND
s.s_state = 'TN'
ORDER BY c.c_customer_id
LIMIT 100;
备注:
- 从不 在
FROM
子句中使用逗号。 始终使用正确、明确的标准JOIN
语法。 - 限定所有列引用,尤其是当一个查询有多个 table 引用时。
- 不需要计算平均值的子查询。
- Hive 使用
LIMIT
,而不是TOP
。