按累积时间间隔将行分组

Partitioning rows into groups by accumulative time interval

我得到的搜索会话日志如下所示:

+----------+-------------------------+----------+
|    dt    |       search_time       | searches |
+----------+-------------------------+----------+
| 20200601 | 2020-06-01 00:36:38.000 |        1 |
| 20200601 | 2020-06-01 00:37:38.000 |        1 |
| 20200601 | 2020-06-01 00:39:18.000 |        1 |
| 20200601 | 2020-06-01 01:16:18.000 |        1 |
| 20200601 | 2020-06-01 03:56:38.000 |        1 |
| 20200601 | 2020-06-01 05:36:38.000 |        1 |
| 20200601 | 2020-06-01 05:37:38.000 |        1 |
| 20200601 | 2020-06-01 05:39:38.000 |        1 |
| 20200601 | 2020-06-01 05:41:38.000 |        1 |
| 20200601 | 2020-06-01 07:26:38.000 |        1 |
+----------+-------------------------+----------+

我的任务是将每一行划分为会话组。会话组最多五分钟。

例如:

TOP 3 会话将组成一个会话组 1 - 如果我们将每一行之间的分钟数累加,我们将得到 3 分钟,而第 4 个会话将累积到超过 5 分钟,因此它将是一个不同的会话组。

+----------+-------------------------+----------+---------------+
|    dt    |       search_time       | searches | group_session |
+----------+-------------------------+----------+---------------+
| 20200601 | 2020-06-01 00:36:38.000 |        1 |             1 |
| 20200601 | 2020-06-01 00:37:38.000 |        1 |             1 |
| 20200601 | 2020-06-01 00:39:18.000 |        1 |             1 |
| 20200601 | 2020-06-01 01:16:18.000 |        1 |             2 |
+----------+-------------------------+----------+---------------+

我这样操作 table 以便为分区做好准备:

WITH [Sub Table] AS
(

SELECT   [dt]
        ,[search_time]
        ,[pervious search time] = LAG(search_time) OVER (ORDER BY search_time)
        ,[min diff] = ISNULL(DATEDIFF(MINUTE,LAG(search_time) OVER (ORDER BY search_time),search_time),0)
        ,[searches]
FROM     [search_session]
)
SELECT
    [dt],
    [search_time],
    [pervious search time],
    [min diff],
    [searches]
FROM [Sub Table]

得到这个:

+----------+-------------------------+-------------------------+----------+----------+
|    dt    |       search_time       |  pervious search time   | min diff | searches |
+----------+-------------------------+-------------------------+----------+----------+
| 20200601 | 2020-06-01 00:36:38.000 | NULL                    |        0 |        1 |
| 20200601 | 2020-06-01 00:37:38.000 | 2020-06-01 00:36:38.000 |        1 |        1 |
| 20200601 | 2020-06-01 00:39:18.000 | 2020-06-01 00:37:38.000 |        2 |        1 |
| 20200601 | 2020-06-01 01:16:18.000 | 2020-06-01 00:39:18.000 |       37 |        1 |
| 20200601 | 2020-06-01 03:56:38.000 | 2020-06-01 01:16:18.000 |      160 |        1 |
| 20200601 | 2020-06-01 05:36:38.000 | 2020-06-01 03:56:38.000 |      100 |        1 |
| 20200601 | 2020-06-01 05:37:38.000 | 2020-06-01 05:36:38.000 |        1 |        1 |
| 20200601 | 2020-06-01 05:39:38.000 | 2020-06-01 05:37:38.000 |        2 |        1 |
| 20200601 | 2020-06-01 05:41:38.000 | 2020-06-01 05:39:38.000 |        2 |        1 |
| 20200601 | 2020-06-01 07:26:38.000 | 2020-06-01 05:41:38.000 |      105 |        1 |
+----------+-------------------------+-------------------------+----------+----------+

我想过两种可能继续:

  1. 使用 window 函数,如 RANK(),我可以对行进行分区,但我无法弄清楚如何设置 PARTITION BY 条件来执行此操作.

  2. 用 WHILE 循环迭代 table - 再次发现很难形成 ths

这不能只用 window 函数来完成。您需要某种迭代过程,跟踪每组的第一行,并动态识别下一行。

在SQL中,您可以用递归查询来表达:

with 
    data as (select t.*, row_number() over(order by search_time) rn from mytable t),
    cte as (
        select d.*, search_time as first_search_time 
        from data d
        where rn = 1
        union all 
        select d.*, 
            case when d.search_time > dateadd(minute, 5, c.first_search_time)
                then d.search_time
                else c.first_search_time
            end
        from cte c 
        inner join data d on d.rn = c.rn + 1
    )
select c.*, dense_rank() over(order by first_search_time) grp 
from cte c

对于您的示例数据,this returns:

dt         | search_time             | searches | rn | first_search_time       | grp
:--------- | :---------------------- | -------: | -: | :---------------------- | --:
2020-06-01 | 2020-06-01 00:36:38.000 |        1 |  1 | 2020-06-01 00:36:38.000 |   1
2020-06-01 | 2020-06-01 00:37:38.000 |        1 |  2 | 2020-06-01 00:36:38.000 |   1
2020-06-01 | 2020-06-01 00:39:18.000 |        1 |  3 | 2020-06-01 00:36:38.000 |   1
2020-06-01 | 2020-06-01 01:16:18.000 |        1 |  4 | 2020-06-01 01:16:18.000 |   2
2020-06-01 | 2020-06-01 03:56:38.000 |        1 |  5 | 2020-06-01 03:56:38.000 |   3
2020-06-01 | 2020-06-01 05:36:38.000 |        1 |  6 | 2020-06-01 05:36:38.000 |   4
2020-06-01 | 2020-06-01 05:37:38.000 |        1 |  7 | 2020-06-01 05:36:38.000 |   4
2020-06-01 | 2020-06-01 05:39:38.000 |        1 |  8 | 2020-06-01 05:36:38.000 |   4
2020-06-01 | 2020-06-01 05:41:38.000 |        1 |  9 | 2020-06-01 05:36:38.000 |   4
2020-06-01 | 2020-06-01 07:26:38.000 |        1 | 10 | 2020-06-01 07:26:38.000 |   5