PostgreSQL 一行中 window 函数的第一个和最后一个值

Question

我想要指定分区的一行中的一列的第一个值和第二列的最后一个值。为此，我创建了这个查询：

SELECT DISTINCT
b.machine_id,
batch,
timestamp_sta,
timestamp_stp,
FIRST_VALUE(timestamp_sta) OVER w AS batch_start,
LAST_VALUE(timestamp_stp) OVER w AS batch_end
FROM db_data.sta_stp AS a
JOIN db_data.ll_lu AS b
ON a.ll_lu_id=b.id
WINDOW w AS (PARTITION BY batch, machine_id ORDER BY timestamp_sta)
ORDER BY timestamp_sta, batch, machine_id;

但是如图所示，batch_end 列中返回的数据不正确。

batch_start 列具有正确的 timestamp_sta 列的第一个值。但是 batch_end 应该是 "2012-09-17 10:49:45" 并且它等于 timestamp_stp 来自同一个行。

为什么会这样？

Answer 1

来自syntax documentation：

The frame_clause specifies the set of rows constituting the window frame, which is a subset of the current partition, for those window functions that act on the frame instead of the whole partition. The frame can be specified in either RANGE or ROWS mode; in either case, it runs from the frame_start to the frame_end. If frame_end is omitted, it defaults to CURRENT ROW.

A frame_start of UNBOUNDED PRECEDING means that the frame starts with the first row of the partition, and similarly a frame_end of UNBOUNDED FOLLOWING means that the frame ends with the last row of the partition.

和function list

last_value(value any) returns value evaluated at the row that is the last row of the window frame

所以正确的 SQL 应该是：

SELECT DISTINCT
b.machine_id,
batch,
timestamp_sta,
timestamp_stp,
FIRST_VALUE(timestamp_sta) OVER w AS batch_start,
LAST_VALUE(timestamp_stp) OVER w AS batch_end
FROM db_data.sta_stp AS a
JOIN db_data.ll_lu AS b
ON a.ll_lu_id=b.id
WINDOW w AS (PARTITION BY batch, machine_id ORDER BY timestamp_sta range between unbounded preceding and unbounded following)
ORDER BY timestamp_sta, batch, machine_id;

Answer 2

@Łukasz Kamiński 的解释解决了问题的核心。

但是，last_value 应替换为 max()。您按 timestamp_sta 排序，因此最后一个值是具有最大 timestamp_sta 的值，这可能与 timestamp_stp 相关，也可能无关。我也会按这两个字段排序。

SELECT DISTINCT
  b.machine_id,
  batch,
  timestamp_sta,
  timestamp_stp,
  FIRST_VALUE(timestamp_sta) OVER w AS batch_start,
  MAX(timestamp_stp) OVER w AS batch_end
FROM db_data.sta_stp AS a
JOIN db_data.ll_lu AS b
ON a.ll_lu_id=b.id
WINDOW w AS (PARTITION BY batch, machine_id 
             ORDER BY timestamp_sta,timestamp_stp 
             RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER BY timestamp_sta, batch, machine_id;

http://rextester.com/UTDE60342

Answer 3

这个问题很老了，但是这个解决方案比目前发布的解决方案更简单、更快速：

SELECT b.machine_id
     , batch
     , timestamp_sta
     , timestamp_stp
     , min(timestamp_sta) OVER w AS batch_start
     , max(timestamp_stp) OVER w AS batch_end
FROM   db_data.sta_stp a
JOIN   db_data.ll_lu   b ON a.ll_lu_id = b.id
WINDOW w AS (PARTITION BY batch, b.machine_id) -- No ORDER BY !
ORDER  BY timestamp_sta, batch, machine_id; -- why this ORDER BY?

如果您将 ORDER BY 添加到 window 帧定义，则具有更大 ORDER BY 表达式的每个下一行都有较晚的帧开始. min() 和 first_value() 都不能 return 整个分区的 "first" 时间戳。没有 ORDER BY 同一分区的所有行都是 peers 并且你得到你想要的结果。

您添加的 ORDER BY 有效（不是 window 框架定义中的那个，外层那个），但似乎没有意义并使查询更昂贵。您可能应该使用与您的 window 框架定义一致的 ORDER BY 子句以避免额外的排序成本：

... 
ORDER BY batch, b.machine_id, timestamp_sta, timestamp_stp;

我认为在此查询中不需要 DISTINCT。如果你真的需要它，你可以添加它。或者 DISTINCT ON ()。但是 ORDER BY 子句变得更加相关。参见：

Select first row in each GROUP BY group?

如果您需要同一行中的一些其他列（同时仍按时间戳排序），您使用 FIRST_VALUE() 和 LAST_VALUE() 的想法可能是可行的方法。您可能需要将其附加到 window 框架定义 then:

ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

参见：

PostgreSQL query with max and min date plus associated id per row

PostgreSQL 一行中 window 函数的第一个和最后一个值

First and last value of window function in one row in PostgreSQL

postgresql

window-functions