雪花 - 用 "key change" 分组
Snowflake - Grouping with a "key change"
我有一个问题,我自己或通过研究都没有找到答案,尽管它应该是可能的。
想象一下,雪花 table 中有以下数据(确切的列类型无关紧要):
A 列
B 列
时间戳
系统
我
02:01
系统
我
02:02
系统
U
02:03
系统
U
02:04
系统
我
02:05
系统
U
02:06
我想聚合数据,以便在我的结果集中有 4 个组,每个组具有最小和最大时间戳:
- 组:ColumnB = I 的前两条记录 (min = 02:01, max = 02:02)
- 组:ColumnB = U 的下两条记录 (min = 02:03, max = 02:04)
- 组:第 5 条记录(最小值、最大值均 = 02:05)
- 组:第 6 条记录(最小值,最大值均 = 02:06)
请注意columnA可以有其他值,如果有,它们应该按照相同的原则在自己的组中。
有谁知道如何使用 SELECT 语句来做到这一点? GROUP BY 显然不起作用,因为我不能以这种方式分隔第 1 组和第 3 组(以及第 2 组和第 4 组)。
是的,有额外的子组列:
WITH cte AS (
SELECT *, LAG(ColumnB) OVER(PARTITION BY ColumnA ORDER BY timestamp) AS prevColumnB
FROM tab
), cte2 AS (
SELECT *,
SUM(CASE WHEN ColumnB = prevColumnB OR prevColumnB IS NULL THEN 0 ELSE 1 END)
OVER(PARTITION BY ColumnA ORDER BY timestamp) as subgrp
FROM cte
)
SELECT ColumnA, ColumnB, subgrp, MIN(timestamp) AS min_t, MAX(timestamp) AS max_t
FROM cte2
GROUP BY ColumnA, ColumnB, subgrp
ORDER BY ColumnA, subgrp;
它是如何工作的:
+----------+----------+-----------+-------------+--------+
| ColumnA | ColumnB | timestamp | prevColumnB | subgrp |
+----------+----------+-----------+-------------+--------+
| SYSTEM | I | 02:01 | NULL | 0 |
| SYSTEM | I | 02:02 | I | 0 |
| SYSTEM | U | 02:03 | I | 1 |
| SYSTEM | U | 02:04 | U | 1 |
| SYSTEM | I | 02:05 | U | 2 |
| SYSTEM | U | 02:06 | I | 3 |
+----------+----------+-----------+-------------+--------+
通过引入 subgrp
列,我们可以执行标准分组。
附录:
MATCH_RECOGNIZE 子句允许在不使用 CTE 的情况下实现类似的效果。
SELECT *
FROM t
MATCH_RECOGNIZE (
PARTITION BY columnA
ORDER BY timestamp
MEASURES MATCH_NUMBER() AS grp_id
--,CLASSIFIER() AS cls
,FIRST_VALUE(columnB) AS columnB
,FIRST_VALUE(timestamp) AS min_t
,LAST_VALUE(timestamp) AS max_t
PATTERN (b* a)
DEFINE a AS columnB != LEAD(columnB) OR LEAD(columnB) IS NULL
,b AS columnB = LEAD(columnB)
) mr
ORDER BY columnA, grp_id;
结果:
COLUMNA GRP_ID COLUMNB MIN_T MAX_T
SYSTEM 1 I 02:01 02:02
SYSTEM 2 U 02:03 02:04
SYSTEM 3 I 02:05 02:05
SYSTEM 4 U 02:06 02:06
这是一种间隙和孤岛问题。在这种情况下,我认为行号的差异是解决问题的最简单方法:
select columnA, columnB, min(timestamp), max(timestamp)
from (select t.*,
row_number() over (partition by columnA order by timestamp) as seqnum,
row_number() over (partition by columnA, columnB order by timestamp) as seqnum_2
from t
) t
group by columnA, columnB, (seqnum - seqnum_2);
为什么这行得通有点难以解释。但是,如果您查看子查询的结果,您将看到 columnB
相同的相邻行的差异是如何保持不变的。
我有一个问题,我自己或通过研究都没有找到答案,尽管它应该是可能的。
想象一下,雪花 table 中有以下数据(确切的列类型无关紧要):
A 列 | B 列 | 时间戳 |
---|---|---|
系统 | 我 | 02:01 |
系统 | 我 | 02:02 |
系统 | U | 02:03 |
系统 | U | 02:04 |
系统 | 我 | 02:05 |
系统 | U | 02:06 |
我想聚合数据,以便在我的结果集中有 4 个组,每个组具有最小和最大时间戳:
- 组:ColumnB = I 的前两条记录 (min = 02:01, max = 02:02)
- 组:ColumnB = U 的下两条记录 (min = 02:03, max = 02:04)
- 组:第 5 条记录(最小值、最大值均 = 02:05)
- 组:第 6 条记录(最小值,最大值均 = 02:06)
请注意columnA可以有其他值,如果有,它们应该按照相同的原则在自己的组中。 有谁知道如何使用 SELECT 语句来做到这一点? GROUP BY 显然不起作用,因为我不能以这种方式分隔第 1 组和第 3 组(以及第 2 组和第 4 组)。
是的,有额外的子组列:
WITH cte AS (
SELECT *, LAG(ColumnB) OVER(PARTITION BY ColumnA ORDER BY timestamp) AS prevColumnB
FROM tab
), cte2 AS (
SELECT *,
SUM(CASE WHEN ColumnB = prevColumnB OR prevColumnB IS NULL THEN 0 ELSE 1 END)
OVER(PARTITION BY ColumnA ORDER BY timestamp) as subgrp
FROM cte
)
SELECT ColumnA, ColumnB, subgrp, MIN(timestamp) AS min_t, MAX(timestamp) AS max_t
FROM cte2
GROUP BY ColumnA, ColumnB, subgrp
ORDER BY ColumnA, subgrp;
它是如何工作的:
+----------+----------+-----------+-------------+--------+
| ColumnA | ColumnB | timestamp | prevColumnB | subgrp |
+----------+----------+-----------+-------------+--------+
| SYSTEM | I | 02:01 | NULL | 0 |
| SYSTEM | I | 02:02 | I | 0 |
| SYSTEM | U | 02:03 | I | 1 |
| SYSTEM | U | 02:04 | U | 1 |
| SYSTEM | I | 02:05 | U | 2 |
| SYSTEM | U | 02:06 | I | 3 |
+----------+----------+-----------+-------------+--------+
通过引入 subgrp
列,我们可以执行标准分组。
附录:
MATCH_RECOGNIZE 子句允许在不使用 CTE 的情况下实现类似的效果。
SELECT *
FROM t
MATCH_RECOGNIZE (
PARTITION BY columnA
ORDER BY timestamp
MEASURES MATCH_NUMBER() AS grp_id
--,CLASSIFIER() AS cls
,FIRST_VALUE(columnB) AS columnB
,FIRST_VALUE(timestamp) AS min_t
,LAST_VALUE(timestamp) AS max_t
PATTERN (b* a)
DEFINE a AS columnB != LEAD(columnB) OR LEAD(columnB) IS NULL
,b AS columnB = LEAD(columnB)
) mr
ORDER BY columnA, grp_id;
结果:
COLUMNA GRP_ID COLUMNB MIN_T MAX_T
SYSTEM 1 I 02:01 02:02
SYSTEM 2 U 02:03 02:04
SYSTEM 3 I 02:05 02:05
SYSTEM 4 U 02:06 02:06
这是一种间隙和孤岛问题。在这种情况下,我认为行号的差异是解决问题的最简单方法:
select columnA, columnB, min(timestamp), max(timestamp)
from (select t.*,
row_number() over (partition by columnA order by timestamp) as seqnum,
row_number() over (partition by columnA, columnB order by timestamp) as seqnum_2
from t
) t
group by columnA, columnB, (seqnum - seqnum_2);
为什么这行得通有点难以解释。但是,如果您查看子查询的结果,您将看到 columnB
相同的相邻行的差异是如何保持不变的。