PostgreSQL 一行中 window 函数的第一个和最后一个值
First and last value of window function in one row in PostgreSQL
我想要指定分区的一行中的一列的第一个值和第二列的最后一个值。为此,我创建了这个查询:
SELECT DISTINCT
b.machine_id,
batch,
timestamp_sta,
timestamp_stp,
FIRST_VALUE(timestamp_sta) OVER w AS batch_start,
LAST_VALUE(timestamp_stp) OVER w AS batch_end
FROM db_data.sta_stp AS a
JOIN db_data.ll_lu AS b
ON a.ll_lu_id=b.id
WINDOW w AS (PARTITION BY batch, machine_id ORDER BY timestamp_sta)
ORDER BY timestamp_sta, batch, machine_id;
但是如图所示,batch_end 列中返回的数据不正确。
batch_start 列具有正确的 timestamp_sta 列的第一个值。但是 batch_end 应该是 "2012-09-17 10:49:45" 并且它等于 timestamp_stp 来自同一个行。
为什么会这样?
The frame_clause specifies the set of rows constituting the window frame, which is a subset of the current partition, for those window functions that act on the frame instead of the whole partition. The frame can be specified in either RANGE or ROWS mode; in either case, it runs from the frame_start to the frame_end. If frame_end is omitted, it defaults to CURRENT ROW.
A frame_start of UNBOUNDED PRECEDING means that the frame starts with the first row of the partition, and similarly a frame_end of UNBOUNDED FOLLOWING means that the frame ends with the last row of the partition.
last_value(value any) returns value evaluated at the row that is the last row of the window frame
所以正确的 SQL 应该是:
SELECT DISTINCT
b.machine_id,
batch,
timestamp_sta,
timestamp_stp,
FIRST_VALUE(timestamp_sta) OVER w AS batch_start,
LAST_VALUE(timestamp_stp) OVER w AS batch_end
FROM db_data.sta_stp AS a
JOIN db_data.ll_lu AS b
ON a.ll_lu_id=b.id
WINDOW w AS (PARTITION BY batch, machine_id ORDER BY timestamp_sta range between unbounded preceding and unbounded following)
ORDER BY timestamp_sta, batch, machine_id;
@Łukasz Kamiński 的解释解决了问题的核心。
但是,last_value
应替换为 max()
。您按 timestamp_sta
排序,因此最后一个值是具有最大 timestamp_sta
的值,这可能与 timestamp_stp
相关,也可能无关。我也会按这两个字段排序。
SELECT DISTINCT
b.machine_id,
batch,
timestamp_sta,
timestamp_stp,
FIRST_VALUE(timestamp_sta) OVER w AS batch_start,
MAX(timestamp_stp) OVER w AS batch_end
FROM db_data.sta_stp AS a
JOIN db_data.ll_lu AS b
ON a.ll_lu_id=b.id
WINDOW w AS (PARTITION BY batch, machine_id
ORDER BY timestamp_sta,timestamp_stp
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER BY timestamp_sta, batch, machine_id;
这个问题很老了,但是这个解决方案比目前发布的解决方案更简单、更快速:
SELECT b.machine_id
, batch
, timestamp_sta
, timestamp_stp
, min(timestamp_sta) OVER w AS batch_start
, max(timestamp_stp) OVER w AS batch_end
FROM db_data.sta_stp a
JOIN db_data.ll_lu b ON a.ll_lu_id = b.id
WINDOW w AS (PARTITION BY batch, b.machine_id) -- No ORDER BY !
ORDER BY timestamp_sta, batch, machine_id; -- why this ORDER BY?
如果您将 ORDER BY
添加到 window 帧定义,则具有更大 ORDER BY
表达式的每个下一行都有较晚的帧开始. min()
和 first_value()
都不能 return 整个分区的 "first" 时间戳。没有 ORDER BY
同一分区的所有行都是 peers 并且你得到你想要的结果。
您添加的 ORDER BY
有效 (不是 window 框架定义中的那个,外层那个),但似乎没有意义并使查询更昂贵。您可能应该使用与您的 window 框架定义一致的 ORDER BY
子句以避免额外的排序成本:
...
ORDER BY batch, b.machine_id, timestamp_sta, timestamp_stp;
我认为在此查询中不需要 DISTINCT
。如果你真的需要它,你可以添加它。或者 DISTINCT ON ()
。但是 ORDER BY
子句变得更加相关。参见:
- Select first row in each GROUP BY group?
如果您需要同一行中的一些其他列(同时仍按时间戳排序),您使用 FIRST_VALUE()
和 LAST_VALUE()
的想法可能是可行的方法。您可能需要将其附加到 window 框架定义 then:
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
参见:
我想要指定分区的一行中的一列的第一个值和第二列的最后一个值。为此,我创建了这个查询:
SELECT DISTINCT
b.machine_id,
batch,
timestamp_sta,
timestamp_stp,
FIRST_VALUE(timestamp_sta) OVER w AS batch_start,
LAST_VALUE(timestamp_stp) OVER w AS batch_end
FROM db_data.sta_stp AS a
JOIN db_data.ll_lu AS b
ON a.ll_lu_id=b.id
WINDOW w AS (PARTITION BY batch, machine_id ORDER BY timestamp_sta)
ORDER BY timestamp_sta, batch, machine_id;
但是如图所示,batch_end 列中返回的数据不正确。
batch_start 列具有正确的 timestamp_sta 列的第一个值。但是 batch_end 应该是 "2012-09-17 10:49:45" 并且它等于 timestamp_stp 来自同一个行。
为什么会这样?
The frame_clause specifies the set of rows constituting the window frame, which is a subset of the current partition, for those window functions that act on the frame instead of the whole partition. The frame can be specified in either RANGE or ROWS mode; in either case, it runs from the frame_start to the frame_end. If frame_end is omitted, it defaults to CURRENT ROW.
A frame_start of UNBOUNDED PRECEDING means that the frame starts with the first row of the partition, and similarly a frame_end of UNBOUNDED FOLLOWING means that the frame ends with the last row of the partition.
last_value(value any) returns value evaluated at the row that is the last row of the window frame
所以正确的 SQL 应该是:
SELECT DISTINCT
b.machine_id,
batch,
timestamp_sta,
timestamp_stp,
FIRST_VALUE(timestamp_sta) OVER w AS batch_start,
LAST_VALUE(timestamp_stp) OVER w AS batch_end
FROM db_data.sta_stp AS a
JOIN db_data.ll_lu AS b
ON a.ll_lu_id=b.id
WINDOW w AS (PARTITION BY batch, machine_id ORDER BY timestamp_sta range between unbounded preceding and unbounded following)
ORDER BY timestamp_sta, batch, machine_id;
@Łukasz Kamiński 的解释解决了问题的核心。
但是,last_value
应替换为 max()
。您按 timestamp_sta
排序,因此最后一个值是具有最大 timestamp_sta
的值,这可能与 timestamp_stp
相关,也可能无关。我也会按这两个字段排序。
SELECT DISTINCT
b.machine_id,
batch,
timestamp_sta,
timestamp_stp,
FIRST_VALUE(timestamp_sta) OVER w AS batch_start,
MAX(timestamp_stp) OVER w AS batch_end
FROM db_data.sta_stp AS a
JOIN db_data.ll_lu AS b
ON a.ll_lu_id=b.id
WINDOW w AS (PARTITION BY batch, machine_id
ORDER BY timestamp_sta,timestamp_stp
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER BY timestamp_sta, batch, machine_id;
这个问题很老了,但是这个解决方案比目前发布的解决方案更简单、更快速:
SELECT b.machine_id
, batch
, timestamp_sta
, timestamp_stp
, min(timestamp_sta) OVER w AS batch_start
, max(timestamp_stp) OVER w AS batch_end
FROM db_data.sta_stp a
JOIN db_data.ll_lu b ON a.ll_lu_id = b.id
WINDOW w AS (PARTITION BY batch, b.machine_id) -- No ORDER BY !
ORDER BY timestamp_sta, batch, machine_id; -- why this ORDER BY?
如果您将 ORDER BY
添加到 window 帧定义,则具有更大 ORDER BY
表达式的每个下一行都有较晚的帧开始. min()
和 first_value()
都不能 return 整个分区的 "first" 时间戳。没有 ORDER BY
同一分区的所有行都是 peers 并且你得到你想要的结果。
您添加的 ORDER BY
有效 (不是 window 框架定义中的那个,外层那个),但似乎没有意义并使查询更昂贵。您可能应该使用与您的 window 框架定义一致的 ORDER BY
子句以避免额外的排序成本:
...
ORDER BY batch, b.machine_id, timestamp_sta, timestamp_stp;
我认为在此查询中不需要 DISTINCT
。如果你真的需要它,你可以添加它。或者 DISTINCT ON ()
。但是 ORDER BY
子句变得更加相关。参见:
- Select first row in each GROUP BY group?
如果您需要同一行中的一些其他列(同时仍按时间戳排序),您使用 FIRST_VALUE()
和 LAST_VALUE()
的想法可能是可行的方法。您可能需要将其附加到 window 框架定义 then:
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
参见: