如何对两列使用 Hive sql 的滞后函数?

How to use Hive sql's lag function with two columns?

我觉得我有一个相当简单的SQL问题需要解决,只是不知道如何正确地搜索它。

假设我有一个 table 值根据时间更新:

|timestamp|value|session|
|---------|-----|-------|
| ts1     | v1  |  s1   |
| ts2     | v2  |  s1   |
| ts3     | v3  |  s1   |
| ...     | ..  |  s2   |

我想获取当前值和先前值以及关联的时间戳

所以结果应该是:

|timestamp_current|value_current|timestamp_prev|value_prev|
|-----------------|-------------|--------------|----------|
|      ts2        |      v2     |    ts1       |    v1    |
|      ts3        |      v3     |    ts2       |    v2    |
|      ...        |      ..     |    ...       |    ..    |

如果我只想获取之前的值而不是之前的时间戳,我认为以下查询是正确的:

select timestamp, value, lag(value,1) over (partition by (session) order by timestamp) from mytable

但是,从前一行中添加两个值的正确方法是什么,我是添加两个 lag 子句还是有更好的方法?

您可以使用 lag() 两次来推导出您的结果;一次用于 prev_timestamp,一次用于 prev_val,如下所示。

select * from
(
select timestamp, 
       value, 
       lag(timestamp) over(partition by session order by timestamp) as prev_timestamp, 
       lag(value) over(partition by session order by timestamp) as prev_value
from table1
) t
where prev_timestamp is not null

where 子句用于排除具有 prev_timestamp 作为 NULL

的行

结果:

+-----------+-------+----------------+------------+
| timestamp | value | prev_timestamp | prev_value |
+-----------+-------+----------------+------------+
| ts2       | v2    | ts1            | v1         |
| ts3       | v3    | ts2            | v2         |
+-----------+-------+----------------+------------+

DEMO

实现此目的的另一种方法是使用 row_number 函数并连接记录,如下所示。但是在同一查询中使用两种滞后方法很可能比 [=13 更高效=] 和左连接方法。

WITH dt AS (SELECT  timestamp, value, ROW_NUMBER() OVER(PARTITION BY session ORDER BY timestamp) as row_num FROM table1)
SELECT 
  t0.timestamp, 
  t0.value, 
  t1.timestamp as prev_timestamp,
  t1.value as prev_value
FROM dt t0 
LEFT OUTER JOIN dt t1
ON t0.row_num = t1.row_num - 1