如何使用 SQL 获取列中每个分区的第一个和最后一个值
How to get a first and last value for each partition in a column using SQL
我的数据集如下所示。
ts c1 c2 c3
2019-01-04T01:50:00.000Z C 25.48801612854004 33.317527770996094
2019-01-04T01:51:00.000Z C 25.74610710144043 33.392295837402344
2019-01-04T01:52:00.000Z C 25.978872299194336 33.29177474975586
2019-01-04T01:53:00.000Z B 26.12158203125 33.2805061340332
2019-01-04T01:54:00.000Z B 26.28511619567871 33.26923751831055
2019-01-04T01:55:00.000Z C 26.470335006713867 33.25796890258789
2019-01-04T01:56:00.000Z C 26.63957977294922 33.24669647216797
2019-01-04T01:57:00.000Z C 26.954004287719727 33.23542785644531
2019-01-04T01:58:00.000Z C 27.08258056640625 33.224159240722656
2019-01-04T01:59:00.000Z A 27.25551986694336 33.212890625
2019-01-04T02:00:00.000Z A 27.514263153076172 33.201622009277344
2019-01-04T02:01:00.000Z A 27.588970184326172 33.17148971557617
2019-01-04T02:02:00.000Z B 27.727638244628906 33.13819122314453
2019-01-04T02:03:00.000Z B 27.956039428710938 33.104896545410156
2019-01-04T02:04:00.000Z B 28.152463912963867 33.10499954223633
我想为列“c1”中的每个分区值获取“ts”的第一个和最后一个值。
我尝试了以下查询,但没有 return 正确的结果。
SELECT ts, c1, c2, c3,
first_value(ts) OVER (partition by c1 order by ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first,
last_value(ts) OVER (partition by c1 order by ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last
FROM `default`.`a07_a15`
问题:第一个值 return 只有三个不同的 ts 值和最大值 return 完全错误。
预期:我需要每个重复分区值的第一个和最后一个值。
ts c1 c2 c3 first last
2019-01-04T01:50:00.000Z C 25.48801612854004 33.317527770996094 2019-01-04T01:50:00.000Z 2019-01-04T01:52:00.000Z
2019-01-04T01:51:00.000Z C 25.74610710144043 33.392295837402344 2019-01-04T01:50:00.000Z 2019-01-04T01:52:00.000Z
2019-01-04T01:52:00.000Z C 25.978872299194336 33.29177474975586 2019-01-04T01:50:00.000Z 2019-01-04T01:52:00.000Z
2019-01-04T01:53:00.000Z B 26.12158203125 33.2805061340332 2019-01-04T01:53:00.000Z 2019-01-04T01:54:00.000Z
2019-01-04T01:54:00.000Z B 26.28511619567871 33.26923751831055 2019-01-04T01:53:00.000Z 2019-01-04T01:54:00.000Z
2019-01-04T01:55:00.000Z C 26.470335006713867 33.25796890258789 2019-01-04T01:55:00.000Z 2019-01-04T01:58:00.000Z
2019-01-04T01:56:00.000Z C 26.63957977294922 33.24669647216797 2019-01-04T01:55:00.000Z 2019-01-04T01:58:00.000Z
2019-01-04T01:57:00.000Z C 26.954004287719727 33.23542785644531 2019-01-04T01:55:00.000Z 2019-01-04T01:58:00.000Z
2019-01-04T01:58:00.000Z C 27.08258056640625 33.224159240722656 2019-01-04T01:55:00.000Z 2019-01-04T01:58:00.000Z
2019-01-04T01:59:00.000Z A 27.25551986694336 33.212890625 2019-01-04T01:59:00.000Z 2019-01-04T02:01:00.000Z
2019-01-04T02:00:00.000Z A 27.514263153076172 33.201622009277344 2019-01-04T01:59:00.000Z 2019-01-04T02:01:00.000Z
2019-01-04T02:01:00.000Z A 27.588970184326172 33.17148971557617 2019-01-04T01:59:00.000Z 2019-01-04T02:01:00.000Z
2019-01-04T02:02:00.000Z B 27.727638244628906 33.13819122314453 2019-01-04T02:02:00.000Z 2019-01-04T02:04:00.000Z
2019-01-04T02:03:00.000Z B 27.956039428710938 33.104896545410156 2019-01-04T02:02:00.000Z 2019-01-04T02:04:00.000Z
2019-01-04T02:04:00.000Z B 28.152463912963867 33.10499954223633 2019-01-04T02:02:00.000Z 2019-01-04T02:04:00.000Z
使用lag()
和lead()
:
select t.*
from (select t.*,
lag(c1) over (order by ts) as prev_c1,
lead(c1) over (order by ts) as next_c1
from t
) t
where prev_c1 is null or next_c1 is null or
prev_c1 <> c1 or next_c1 <> c1;
这会将值放在不同的行中。如果您希望它们在同一行中,可能将其视为间隙和孤岛问题是最简单的解决方案:
select c1, min(ts), max(ts)
from (select t.*,
row_number() over (order by ts) as seqnum,
row_number() over (partition by c1 order by ts) as seqnum_2
from t
) t
group by c1, (seqnum - seqnum_2);
编辑:
如果您需要保留原始行,只需使用 window 函数,确保别名匹配:
select t.*,
min(ts) over (partition by c1, (seqnum - seqnum2)) as min_ts,
max(ts) over (partition by c1, (seqnum - seqnum2)) as max_ts
from (select t.*,
row_number() over (order by ts) as seqnum,
row_number() over (partition by c1 order by ts) as seqnum2
from t
) t
我的数据集如下所示。
ts c1 c2 c3
2019-01-04T01:50:00.000Z C 25.48801612854004 33.317527770996094
2019-01-04T01:51:00.000Z C 25.74610710144043 33.392295837402344
2019-01-04T01:52:00.000Z C 25.978872299194336 33.29177474975586
2019-01-04T01:53:00.000Z B 26.12158203125 33.2805061340332
2019-01-04T01:54:00.000Z B 26.28511619567871 33.26923751831055
2019-01-04T01:55:00.000Z C 26.470335006713867 33.25796890258789
2019-01-04T01:56:00.000Z C 26.63957977294922 33.24669647216797
2019-01-04T01:57:00.000Z C 26.954004287719727 33.23542785644531
2019-01-04T01:58:00.000Z C 27.08258056640625 33.224159240722656
2019-01-04T01:59:00.000Z A 27.25551986694336 33.212890625
2019-01-04T02:00:00.000Z A 27.514263153076172 33.201622009277344
2019-01-04T02:01:00.000Z A 27.588970184326172 33.17148971557617
2019-01-04T02:02:00.000Z B 27.727638244628906 33.13819122314453
2019-01-04T02:03:00.000Z B 27.956039428710938 33.104896545410156
2019-01-04T02:04:00.000Z B 28.152463912963867 33.10499954223633
我想为列“c1”中的每个分区值获取“ts”的第一个和最后一个值。 我尝试了以下查询,但没有 return 正确的结果。
SELECT ts, c1, c2, c3,
first_value(ts) OVER (partition by c1 order by ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first,
last_value(ts) OVER (partition by c1 order by ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as last
FROM `default`.`a07_a15`
问题:第一个值 return 只有三个不同的 ts 值和最大值 return 完全错误。
预期:我需要每个重复分区值的第一个和最后一个值。
ts c1 c2 c3 first last
2019-01-04T01:50:00.000Z C 25.48801612854004 33.317527770996094 2019-01-04T01:50:00.000Z 2019-01-04T01:52:00.000Z
2019-01-04T01:51:00.000Z C 25.74610710144043 33.392295837402344 2019-01-04T01:50:00.000Z 2019-01-04T01:52:00.000Z
2019-01-04T01:52:00.000Z C 25.978872299194336 33.29177474975586 2019-01-04T01:50:00.000Z 2019-01-04T01:52:00.000Z
2019-01-04T01:53:00.000Z B 26.12158203125 33.2805061340332 2019-01-04T01:53:00.000Z 2019-01-04T01:54:00.000Z
2019-01-04T01:54:00.000Z B 26.28511619567871 33.26923751831055 2019-01-04T01:53:00.000Z 2019-01-04T01:54:00.000Z
2019-01-04T01:55:00.000Z C 26.470335006713867 33.25796890258789 2019-01-04T01:55:00.000Z 2019-01-04T01:58:00.000Z
2019-01-04T01:56:00.000Z C 26.63957977294922 33.24669647216797 2019-01-04T01:55:00.000Z 2019-01-04T01:58:00.000Z
2019-01-04T01:57:00.000Z C 26.954004287719727 33.23542785644531 2019-01-04T01:55:00.000Z 2019-01-04T01:58:00.000Z
2019-01-04T01:58:00.000Z C 27.08258056640625 33.224159240722656 2019-01-04T01:55:00.000Z 2019-01-04T01:58:00.000Z
2019-01-04T01:59:00.000Z A 27.25551986694336 33.212890625 2019-01-04T01:59:00.000Z 2019-01-04T02:01:00.000Z
2019-01-04T02:00:00.000Z A 27.514263153076172 33.201622009277344 2019-01-04T01:59:00.000Z 2019-01-04T02:01:00.000Z
2019-01-04T02:01:00.000Z A 27.588970184326172 33.17148971557617 2019-01-04T01:59:00.000Z 2019-01-04T02:01:00.000Z
2019-01-04T02:02:00.000Z B 27.727638244628906 33.13819122314453 2019-01-04T02:02:00.000Z 2019-01-04T02:04:00.000Z
2019-01-04T02:03:00.000Z B 27.956039428710938 33.104896545410156 2019-01-04T02:02:00.000Z 2019-01-04T02:04:00.000Z
2019-01-04T02:04:00.000Z B 28.152463912963867 33.10499954223633 2019-01-04T02:02:00.000Z 2019-01-04T02:04:00.000Z
使用lag()
和lead()
:
select t.*
from (select t.*,
lag(c1) over (order by ts) as prev_c1,
lead(c1) over (order by ts) as next_c1
from t
) t
where prev_c1 is null or next_c1 is null or
prev_c1 <> c1 or next_c1 <> c1;
这会将值放在不同的行中。如果您希望它们在同一行中,可能将其视为间隙和孤岛问题是最简单的解决方案:
select c1, min(ts), max(ts)
from (select t.*,
row_number() over (order by ts) as seqnum,
row_number() over (partition by c1 order by ts) as seqnum_2
from t
) t
group by c1, (seqnum - seqnum_2);
编辑:
如果您需要保留原始行,只需使用 window 函数,确保别名匹配:
select t.*,
min(ts) over (partition by c1, (seqnum - seqnum2)) as min_ts,
max(ts) over (partition by c1, (seqnum - seqnum2)) as max_ts
from (select t.*,
row_number() over (order by ts) as seqnum,
row_number() over (partition by c1 order by ts) as seqnum2
from t
) t