COUNT() OVER 以当前行为条件
COUNT() OVER conditioned on the CURRENT ROW
给定代表一个任务的每一行,以及开始时间和结束时间,我如何计算每个任务开始时(包括它本身)的 运行 个任务(即开始和未结束)的数量) 将 window 函数与 COUNT OVER
一起使用? window 函数是正确的方法吗?
示例,给定 table tasks
:
task_id start_time end_time
a 1 10
b 2 5
c 5 15
d 8 13
e 12 20
f 21 30
计算running_tasks
:
task_id start_time end_time running_tasks
a 1 10 1 # a
b 2 5 2 # a,b
c 5 15 2 # a,c (b has ended)
d 8 13 3 # a,c,d
e 12 20 3 # c,d,e (a has ended)
f 21 30 1 # f (c,d,e have ended)
select task_id,start_time,end_time,running_tasks
from (select task_id,tm,op,start_time,end_time
,sum(op) over
(
order by tm,op
rows unbounded preceding
) as running_tasks
from (select task_id,start_time as tm,1 as op,start_time,end_time
from tasks
union all
select task_id,end_time as tm,-1 as op,start_time,end_time
from tasks
) t
)t
where op = 1
;
您可以使用相关子查询,在本例中为自连接;不需要分析函数。启用 standard SQL 后(取消选中 UI 中 "Show Options" 下的 "Use Legacy SQL")你可以 运行 这个例子:
WITH tasks AS (
SELECT
task_id,
start_time,
end_time
FROM UNNEST(ARRAY<STRUCT<task_id STRING, start_time INT64, end_time INT64>>[
('a', 1, 10),
('b', 2, 5),
('c', 5, 15),
('d', 8, 13),
('e', 12, 20),
('f', 21, 30)
])
)
SELECT
*,
(SELECT COUNT(*) FROM tasks t2
WHERE t.start_time >= t2.start_time AND
t.start_time < t2.end_time) AS running_tasks
FROM tasks t
ORDER BY task_id;
正如 Elliott 提到的那样 - "it's generally more difficult to explain analytic functions to new users" 甚至老牌用户也不总是 100% 擅长(虽然非常非常接近)!
因此,虽然 Dudu Markovitz 的回答很好——不幸的是,它仍然是不正确的(至少根据我对问题的理解)。它不正确的情况是当您同时启动多个任务时 start_time - 所以这些任务有错误的 "running tasks" 结果
举个例子——考虑下面的例子:
task_id start_time end_time
a 1 10
aa 1 2
aaa 1 8
b 2 5
c 5 15
d 8 13
e 12 20
f 21 30
我想,您会期望以下结果:
task_id start_time end_time running_tasks
a 1 10 3 # a,aa,aaa
aa 1 2 3 # a,aa,aaa
aaa 1 8 3 # a,aa,aaa
b 2 5 3 # a,aaa,b (aa has ended)
c 5 15 3 # a,aaa,c (b has ended)
d 8 13 3 # a,c,d (aaa has ended)
e 12 20 3 # c,d,e (a has ended)
f 21 30 1 # f (c,d,e have ended)
如果你尝试使用 Dudu 的代码 - 你会得到下面的结果
task_id start_time end_time running_tasks
a 1 10 1
aa 1 2 2
aaa 1 8 3
b 2 5 3
c 5 15 3
d 8 13 3
e 12 20 3
f 21 30 1
如您所见,任务 a 和 aa 的结果是错误的。
原因是因为使用 ROWS UNBOUNDED PRECEDING
而不是 RANGE UNBOUNDED PRECEDING
- 细微但非常重要的细微差别!
所以下面的查询会给你正确的结果
SELECT task_id,start_time,end_time,running_tasks
FROM (
SELECT
task_id, tm, op, start_time, end_time,
SUM(op) OVER (ORDER BY tm ,op RANGE UNBOUNDED PRECEDING) AS running_tasks
FROM (
SELECT
task_id, start_time AS tm, 1 AS op, start_time, end_time
FROM tasks UNION ALL
SELECT
task_id, end_time AS tm, -1 AS op, start_time, end_time
FROM tasks
) t
)t
WHERE op = 1
ORDER BY start_time
快速总结:
ROWS UNBOUNDED PRECEDING - 根据行的位置
设置 window 框架
而
RANGE UNBOUNDED PRECEDING - 根据行值
设置 window 框架
同样 - 正如 Elliott 所提到的 - 完全理解它比 JOIN 概念复杂得多 - 但它值得(因为它比连接更有效) - 查看更多关于 Window Frame Clause 和ROWS 与 RANGE 使用
给定代表一个任务的每一行,以及开始时间和结束时间,我如何计算每个任务开始时(包括它本身)的 运行 个任务(即开始和未结束)的数量) 将 window 函数与 COUNT OVER
一起使用? window 函数是正确的方法吗?
示例,给定 table tasks
:
task_id start_time end_time
a 1 10
b 2 5
c 5 15
d 8 13
e 12 20
f 21 30
计算running_tasks
:
task_id start_time end_time running_tasks
a 1 10 1 # a
b 2 5 2 # a,b
c 5 15 2 # a,c (b has ended)
d 8 13 3 # a,c,d
e 12 20 3 # c,d,e (a has ended)
f 21 30 1 # f (c,d,e have ended)
select task_id,start_time,end_time,running_tasks
from (select task_id,tm,op,start_time,end_time
,sum(op) over
(
order by tm,op
rows unbounded preceding
) as running_tasks
from (select task_id,start_time as tm,1 as op,start_time,end_time
from tasks
union all
select task_id,end_time as tm,-1 as op,start_time,end_time
from tasks
) t
)t
where op = 1
;
您可以使用相关子查询,在本例中为自连接;不需要分析函数。启用 standard SQL 后(取消选中 UI 中 "Show Options" 下的 "Use Legacy SQL")你可以 运行 这个例子:
WITH tasks AS (
SELECT
task_id,
start_time,
end_time
FROM UNNEST(ARRAY<STRUCT<task_id STRING, start_time INT64, end_time INT64>>[
('a', 1, 10),
('b', 2, 5),
('c', 5, 15),
('d', 8, 13),
('e', 12, 20),
('f', 21, 30)
])
)
SELECT
*,
(SELECT COUNT(*) FROM tasks t2
WHERE t.start_time >= t2.start_time AND
t.start_time < t2.end_time) AS running_tasks
FROM tasks t
ORDER BY task_id;
正如 Elliott 提到的那样 - "it's generally more difficult to explain analytic functions to new users" 甚至老牌用户也不总是 100% 擅长(虽然非常非常接近)!
因此,虽然 Dudu Markovitz 的回答很好——不幸的是,它仍然是不正确的(至少根据我对问题的理解)。它不正确的情况是当您同时启动多个任务时 start_time - 所以这些任务有错误的 "running tasks" 结果
举个例子——考虑下面的例子:
task_id start_time end_time
a 1 10
aa 1 2
aaa 1 8
b 2 5
c 5 15
d 8 13
e 12 20
f 21 30
我想,您会期望以下结果:
task_id start_time end_time running_tasks
a 1 10 3 # a,aa,aaa
aa 1 2 3 # a,aa,aaa
aaa 1 8 3 # a,aa,aaa
b 2 5 3 # a,aaa,b (aa has ended)
c 5 15 3 # a,aaa,c (b has ended)
d 8 13 3 # a,c,d (aaa has ended)
e 12 20 3 # c,d,e (a has ended)
f 21 30 1 # f (c,d,e have ended)
如果你尝试使用 Dudu 的代码 - 你会得到下面的结果
task_id start_time end_time running_tasks
a 1 10 1
aa 1 2 2
aaa 1 8 3
b 2 5 3
c 5 15 3
d 8 13 3
e 12 20 3
f 21 30 1
如您所见,任务 a 和 aa 的结果是错误的。
原因是因为使用 ROWS UNBOUNDED PRECEDING
而不是 RANGE UNBOUNDED PRECEDING
- 细微但非常重要的细微差别!
所以下面的查询会给你正确的结果
SELECT task_id,start_time,end_time,running_tasks
FROM (
SELECT
task_id, tm, op, start_time, end_time,
SUM(op) OVER (ORDER BY tm ,op RANGE UNBOUNDED PRECEDING) AS running_tasks
FROM (
SELECT
task_id, start_time AS tm, 1 AS op, start_time, end_time
FROM tasks UNION ALL
SELECT
task_id, end_time AS tm, -1 AS op, start_time, end_time
FROM tasks
) t
)t
WHERE op = 1
ORDER BY start_time
快速总结:
ROWS UNBOUNDED PRECEDING - 根据行的位置
设置 window 框架
而
RANGE UNBOUNDED PRECEDING - 根据行值
同样 - 正如 Elliott 所提到的 - 完全理解它比 JOIN 概念复杂得多 - 但它值得(因为它比连接更有效) - 查看更多关于 Window Frame Clause 和ROWS 与 RANGE 使用