在其他列上有条件地从 hive table 获取最新的列值

get latest column value from hive table conditionally on other columns

我有一个 Hive table 'Orders' 有四列(id 字符串、名称字符串、顺序字符串、ts 字符串)。 table的样本数据如下。

-------------------------------------------
id  name    order               ts
------------------------------------------- 
1   abc     completed       2018-04-12 08:15:26     
2   def     received        2018-04-15 06:20:17
3   ghi     processed       2018-04-16 11:36:56
4   jkl     received        2018-04-05 12:23:34
3   ghi     received        2018-03-23 16:43:46
1   abc     processed       2018-03-17 18:39:22
1   abc     received        2018-02-25 20:07:56

Order 列有三个状态 received -> processed -> completed。单个名称有很多订单,每个订单都有这三个阶段。我需要给定 'id' 和 'name' 的最新订单值。这对你来说似乎是一个新手问题,但我坚持这个。

我尝试编写如下查询,但它们不起作用,而且我无法直接在 'ts' 列上使用 max 函数,因为它是字符串格式。请建议一个最好的方法。 提前致谢。

我试过的查询

SELECT
ORDER
FROM Orders
WHERE id = '1'
    AND name = 'ghi'
    AND ts = (
        SELECT max(unix_timestamp(ts, 'yyyy-MM-dd HH:mm:SS'))
        FROM Orders
        )

Error while compiling statement: FAILED: ParseException line 2:0 cannot recognize input near 'select' 'max' '(' in expression specification

SELECT
ORDER
FROM Orders
WHERE id = '1'
    AND name = 'ghi'
    AND max(unix_timestamp(ts, 'yyyy-MM-dd HH:mm:SS'))

Error while compiling statement: FAILED: SemanticException [Error 10128]: Line 1:93 Not yet supported place for UDAF 'max'

select o.order  from Orders o
inner join ( 
    select id, name, order, max(ts) as ts
    from Orders
    group by id, name, order
) ord on d.id = ord.id and o.name = ord.name and o.ts = ord.ts where o.id = '1' and o.name = 'abc'

执行了此查询,但输出的不是单个最新订单阶段,而是每个订单阶段都有对应的最新时间戳。

请帮忙。

您可以使用 RANK 分析功能来解决您的问题,如下所示:

select id,name,order,ts
from (select id,name,order,ts,rank() over(partition by id,name order by ts) r from orders)k
where r = 1
and id = '1'
and name = 'ghi'

如果您想获取所有 ID 和姓名的最新记录,则无需传递 "ID" 和 "NAME" 的值,您将轻松获得所需的结果。

祝一切顺利!!!

对于给定的订单,您需要一行。因此,您可以使用 order bylimit:

SELECT o.*
FROM Orders o
WHERE id = 1 AND  -- presumably id is a number
     name = 'ghi'
ORDER BY ts DESC
LIMIT 1;

这个应该也是性能最好的