HiveQL

Question

我希望在 HiveQL 中检索 window 的第一行和最后一行。
我知道有几种方法可以做到这一点：

在我感兴趣的列上使用FIRST_VALUE和LAST_VALUE。

SELECT customer,
FIRST_VALUE(product) over (W),
FIRST_VALUE(time) over (W),
LAST_VALUE(product) over (W),
LAST_VALUE(time) over (W)
FROM table
WINDOW W AS (PARTITION BY customer ORDER BY COST)

计算每行的 ROW_NUMBER() 并为 row_number=1 使用 where 子句。

WITH table_wRN AS  
(
    SELECT *,
    row_number() over (partition by customer order by cost ASC) rn_B,
    row_number() over (partition by customer order by cost DESC) rn_E
    FROM table 
),
table_first_last AS
(
SELECT *
FROM table_wRN 
WHERE (rn_E=1 OR rn_B=1)
)

SELECT table_first.customer,
table_first.product, table_first.time,
table_last.product, table_last.time
FROM table_first_last as table_first WHERE table_first_last.rn_B=1
JOIN table_first_last as table_last WHERE table_first_last.rn_E=1
ON table_first.customer = table_last.customer

我的问题：

有谁知道这两个哪个效率更高？
- 直觉上，我认为第一个应该更快，因为不需要子查询或 CTE。
- 根据实验，我觉得第二个更快，但这可能是因为我在多个列上运行 first_value。
有没有办法一次性应用 first_value 并检索多个列。
- 我希望减少 windowing 完成/评估的次数（类似于缓存 window）
- 伪代码示例：
  - FIRST_VALUE(product,time) OVER (W) AS product_first, time_first

谢谢！

Answer 1

我几乎可以肯定第一个会更有效率。我的意思是两个 window 函数与两个 window 函数、过滤和两个连接？

一旦你乘以列数，那么可能会出现哪个更快的问题。也就是说，看看执行计划。我希望使用相同 window 帧规范的所有 window 函数将使用相同的 "windows" 处理，只是对每个值进行调整。

Hive 对字符串和数组等复杂数据类型的支持不是很好。在这样做的数据库中，提供复杂类型很容易。

HiveQL - first_value 多列超过 window

HiveQL - first_value of multiple columns over window

sql

hadoop