COUNT() OVER 以当前行为条件

Question

给定代表一个任务的每一行，以及开始时间和结束时间，我如何计算每个任务开始时（包括它本身）的运行个任务（即开始和未结束）的数量) 将 window 函数与 COUNT OVER 一起使用？ window 函数是正确的方法吗？

示例，给定 table tasks:

task_id  start_time  end_time
   a         1          10
   b         2           5
   c         5          15
   d         8          13
   e        12          20
   f        21          30

计算running_tasks:

task_id  start_time  end_time  running_tasks
   a         1          10           1         # a
   b         2           5           2         # a,b
   c         5          15           2         # a,c (b has ended)
   d         8          13           3         # a,c,d
   e        12          20           3         # c,d,e (a has ended)
   f        21          30           1         # f (c,d,e have ended)

Answer 1

select      task_id,start_time,end_time,running_tasks 

from       (select      task_id,tm,op,start_time,end_time

                       ,sum(op) over 
                        (
                            order by    tm,op 
                            rows        unbounded preceding
                        ) as running_tasks 

            from       (select      task_id,start_time as tm,1 as op,start_time,end_time 
                        from        tasks 

                        union   all 

                        select      task_id,end_time as tm,-1 as op,start_time,end_time 
                        from        tasks 
                        ) t 
            )t 

where       op = 1
;

Answer 2

您可以使用相关子查询，在本例中为自连接；不需要分析函数。启用 standard SQL 后（取消选中 UI 中 "Show Options" 下的 "Use Legacy SQL"）你可以运行这个例子：

WITH tasks AS (
  SELECT
    task_id,
    start_time,
    end_time
  FROM UNNEST(ARRAY<STRUCT<task_id STRING, start_time INT64, end_time INT64>>[
    ('a', 1, 10),
    ('b', 2, 5),
    ('c', 5, 15),
    ('d', 8, 13),
    ('e', 12, 20),
    ('f', 21, 30)
  ])
)
SELECT
  *,
  (SELECT COUNT(*) FROM tasks t2
   WHERE t.start_time >= t2.start_time AND
   t.start_time < t2.end_time) AS running_tasks
FROM tasks t
ORDER BY task_id;

Answer 3

正如 Elliott 提到的那样 - "it's generally more difficult to explain analytic functions to new users" 甚至老牌用户也不总是 100% 擅长（虽然非常非常接近）！
因此，虽然 Dudu Markovitz 的回答很好——不幸的是，它仍然是不正确的（至少根据我对问题的理解）。它不正确的情况是当您同时启动多个任务时 start_time - 所以这些任务有错误的 "running tasks" 结果

举个例子——考虑下面的例子：

task_id  start_time  end_time
   a         1          10
   aa        1           2
   aaa       1           8
   b         2           5
   c         5          15
   d         8          13
   e        12          20
   f        21          30

我想，您会期望以下结果：

task_id  start_time  end_time  running_tasks
   a         1          10           3         # a,aa,aaa
   aa        1           2           3         # a,aa,aaa
   aaa       1           8           3         # a,aa,aaa
   b         2           5           3         # a,aaa,b (aa has ended)
   c         5          15           3         # a,aaa,c (b has ended)
   d         8          13           3         # a,c,d (aaa has ended)
   e        12          20           3         # c,d,e (a has ended)
   f        21          30           1         # f (c,d,e have ended)

如果你尝试使用 Dudu 的代码 - 你会得到下面的结果

task_id  start_time  end_time  running_tasks
   a         1          10           1        
   aa        1           2           2        
   aaa       1           8           3        
   b         2           5           3        
   c         5          15           3        
   d         8          13           3        
   e        12          20           3        
   f        21          30           1

如您所见，任务 a 和 aa 的结果是错误的。
原因是因为使用 ROWS UNBOUNDED PRECEDING 而不是 RANGE UNBOUNDED PRECEDING - 细微但非常重要的细微差别！

所以下面的查询会给你正确的结果

SELECT  task_id,start_time,end_time,running_tasks 
FROM  (
  SELECT  
    task_id, tm, op, start_time, end_time,
    SUM(op) OVER (ORDER BY  tm ,op RANGE UNBOUNDED PRECEDING) AS running_tasks 
  FROM  (
    SELECT  
      task_id, start_time AS tm, 1 AS op, start_time, end_time 
    FROM  tasks UNION  ALL 
    SELECT  
      task_id, end_time AS tm, -1 AS op, start_time, end_time 
    FROM  tasks 
  ) t 
)t 
WHERE  op = 1
ORDER BY start_time

快速总结：
ROWS UNBOUNDED PRECEDING - 根据行的位置
设置 window 框架而
RANGE UNBOUNDED PRECEDING - 根据行值

设置 window 框架

同样 - 正如 Elliott 所提到的 - 完全理解它比 JOIN 概念复杂得多 - 但它值得（因为它比连接更有效） - 查看更多关于 Window Frame Clause 和ROWS 与 RANGE 使用

COUNT() OVER 以当前行为条件

COUNT() OVER conditioned on the CURRENT ROW

sql

window-functions

google-bigquery