按累积时间间隔将行分组
Partitioning rows into groups by accumulative time interval
我得到的搜索会话日志如下所示:
+----------+-------------------------+----------+
| dt | search_time | searches |
+----------+-------------------------+----------+
| 20200601 | 2020-06-01 00:36:38.000 | 1 |
| 20200601 | 2020-06-01 00:37:38.000 | 1 |
| 20200601 | 2020-06-01 00:39:18.000 | 1 |
| 20200601 | 2020-06-01 01:16:18.000 | 1 |
| 20200601 | 2020-06-01 03:56:38.000 | 1 |
| 20200601 | 2020-06-01 05:36:38.000 | 1 |
| 20200601 | 2020-06-01 05:37:38.000 | 1 |
| 20200601 | 2020-06-01 05:39:38.000 | 1 |
| 20200601 | 2020-06-01 05:41:38.000 | 1 |
| 20200601 | 2020-06-01 07:26:38.000 | 1 |
+----------+-------------------------+----------+
我的任务是将每一行划分为会话组。会话组最多五分钟。
例如:
TOP 3 会话将组成一个会话组 1 - 如果我们将每一行之间的分钟数累加,我们将得到 3 分钟,而第 4 个会话将累积到超过 5 分钟,因此它将是一个不同的会话组。
+----------+-------------------------+----------+---------------+
| dt | search_time | searches | group_session |
+----------+-------------------------+----------+---------------+
| 20200601 | 2020-06-01 00:36:38.000 | 1 | 1 |
| 20200601 | 2020-06-01 00:37:38.000 | 1 | 1 |
| 20200601 | 2020-06-01 00:39:18.000 | 1 | 1 |
| 20200601 | 2020-06-01 01:16:18.000 | 1 | 2 |
+----------+-------------------------+----------+---------------+
我这样操作 table 以便为分区做好准备:
WITH [Sub Table] AS
(
SELECT [dt]
,[search_time]
,[pervious search time] = LAG(search_time) OVER (ORDER BY search_time)
,[min diff] = ISNULL(DATEDIFF(MINUTE,LAG(search_time) OVER (ORDER BY search_time),search_time),0)
,[searches]
FROM [search_session]
)
SELECT
[dt],
[search_time],
[pervious search time],
[min diff],
[searches]
FROM [Sub Table]
得到这个:
+----------+-------------------------+-------------------------+----------+----------+
| dt | search_time | pervious search time | min diff | searches |
+----------+-------------------------+-------------------------+----------+----------+
| 20200601 | 2020-06-01 00:36:38.000 | NULL | 0 | 1 |
| 20200601 | 2020-06-01 00:37:38.000 | 2020-06-01 00:36:38.000 | 1 | 1 |
| 20200601 | 2020-06-01 00:39:18.000 | 2020-06-01 00:37:38.000 | 2 | 1 |
| 20200601 | 2020-06-01 01:16:18.000 | 2020-06-01 00:39:18.000 | 37 | 1 |
| 20200601 | 2020-06-01 03:56:38.000 | 2020-06-01 01:16:18.000 | 160 | 1 |
| 20200601 | 2020-06-01 05:36:38.000 | 2020-06-01 03:56:38.000 | 100 | 1 |
| 20200601 | 2020-06-01 05:37:38.000 | 2020-06-01 05:36:38.000 | 1 | 1 |
| 20200601 | 2020-06-01 05:39:38.000 | 2020-06-01 05:37:38.000 | 2 | 1 |
| 20200601 | 2020-06-01 05:41:38.000 | 2020-06-01 05:39:38.000 | 2 | 1 |
| 20200601 | 2020-06-01 07:26:38.000 | 2020-06-01 05:41:38.000 | 105 | 1 |
+----------+-------------------------+-------------------------+----------+----------+
我想过两种可能继续:
使用 window 函数,如 RANK(),我可以对行进行分区,但我无法弄清楚如何设置 PARTITION BY 条件来执行此操作.
用 WHILE 循环迭代 table - 再次发现很难形成 ths
这不能只用 window 函数来完成。您需要某种迭代过程,跟踪每组的第一行,并动态识别下一行。
在SQL中,您可以用递归查询来表达:
with
data as (select t.*, row_number() over(order by search_time) rn from mytable t),
cte as (
select d.*, search_time as first_search_time
from data d
where rn = 1
union all
select d.*,
case when d.search_time > dateadd(minute, 5, c.first_search_time)
then d.search_time
else c.first_search_time
end
from cte c
inner join data d on d.rn = c.rn + 1
)
select c.*, dense_rank() over(order by first_search_time) grp
from cte c
对于您的示例数据,this returns:
dt | search_time | searches | rn | first_search_time | grp
:--------- | :---------------------- | -------: | -: | :---------------------- | --:
2020-06-01 | 2020-06-01 00:36:38.000 | 1 | 1 | 2020-06-01 00:36:38.000 | 1
2020-06-01 | 2020-06-01 00:37:38.000 | 1 | 2 | 2020-06-01 00:36:38.000 | 1
2020-06-01 | 2020-06-01 00:39:18.000 | 1 | 3 | 2020-06-01 00:36:38.000 | 1
2020-06-01 | 2020-06-01 01:16:18.000 | 1 | 4 | 2020-06-01 01:16:18.000 | 2
2020-06-01 | 2020-06-01 03:56:38.000 | 1 | 5 | 2020-06-01 03:56:38.000 | 3
2020-06-01 | 2020-06-01 05:36:38.000 | 1 | 6 | 2020-06-01 05:36:38.000 | 4
2020-06-01 | 2020-06-01 05:37:38.000 | 1 | 7 | 2020-06-01 05:36:38.000 | 4
2020-06-01 | 2020-06-01 05:39:38.000 | 1 | 8 | 2020-06-01 05:36:38.000 | 4
2020-06-01 | 2020-06-01 05:41:38.000 | 1 | 9 | 2020-06-01 05:36:38.000 | 4
2020-06-01 | 2020-06-01 07:26:38.000 | 1 | 10 | 2020-06-01 07:26:38.000 | 5
我得到的搜索会话日志如下所示:
+----------+-------------------------+----------+
| dt | search_time | searches |
+----------+-------------------------+----------+
| 20200601 | 2020-06-01 00:36:38.000 | 1 |
| 20200601 | 2020-06-01 00:37:38.000 | 1 |
| 20200601 | 2020-06-01 00:39:18.000 | 1 |
| 20200601 | 2020-06-01 01:16:18.000 | 1 |
| 20200601 | 2020-06-01 03:56:38.000 | 1 |
| 20200601 | 2020-06-01 05:36:38.000 | 1 |
| 20200601 | 2020-06-01 05:37:38.000 | 1 |
| 20200601 | 2020-06-01 05:39:38.000 | 1 |
| 20200601 | 2020-06-01 05:41:38.000 | 1 |
| 20200601 | 2020-06-01 07:26:38.000 | 1 |
+----------+-------------------------+----------+
我的任务是将每一行划分为会话组。会话组最多五分钟。
例如:
TOP 3 会话将组成一个会话组 1 - 如果我们将每一行之间的分钟数累加,我们将得到 3 分钟,而第 4 个会话将累积到超过 5 分钟,因此它将是一个不同的会话组。
+----------+-------------------------+----------+---------------+
| dt | search_time | searches | group_session |
+----------+-------------------------+----------+---------------+
| 20200601 | 2020-06-01 00:36:38.000 | 1 | 1 |
| 20200601 | 2020-06-01 00:37:38.000 | 1 | 1 |
| 20200601 | 2020-06-01 00:39:18.000 | 1 | 1 |
| 20200601 | 2020-06-01 01:16:18.000 | 1 | 2 |
+----------+-------------------------+----------+---------------+
我这样操作 table 以便为分区做好准备:
WITH [Sub Table] AS
(
SELECT [dt]
,[search_time]
,[pervious search time] = LAG(search_time) OVER (ORDER BY search_time)
,[min diff] = ISNULL(DATEDIFF(MINUTE,LAG(search_time) OVER (ORDER BY search_time),search_time),0)
,[searches]
FROM [search_session]
)
SELECT
[dt],
[search_time],
[pervious search time],
[min diff],
[searches]
FROM [Sub Table]
得到这个:
+----------+-------------------------+-------------------------+----------+----------+
| dt | search_time | pervious search time | min diff | searches |
+----------+-------------------------+-------------------------+----------+----------+
| 20200601 | 2020-06-01 00:36:38.000 | NULL | 0 | 1 |
| 20200601 | 2020-06-01 00:37:38.000 | 2020-06-01 00:36:38.000 | 1 | 1 |
| 20200601 | 2020-06-01 00:39:18.000 | 2020-06-01 00:37:38.000 | 2 | 1 |
| 20200601 | 2020-06-01 01:16:18.000 | 2020-06-01 00:39:18.000 | 37 | 1 |
| 20200601 | 2020-06-01 03:56:38.000 | 2020-06-01 01:16:18.000 | 160 | 1 |
| 20200601 | 2020-06-01 05:36:38.000 | 2020-06-01 03:56:38.000 | 100 | 1 |
| 20200601 | 2020-06-01 05:37:38.000 | 2020-06-01 05:36:38.000 | 1 | 1 |
| 20200601 | 2020-06-01 05:39:38.000 | 2020-06-01 05:37:38.000 | 2 | 1 |
| 20200601 | 2020-06-01 05:41:38.000 | 2020-06-01 05:39:38.000 | 2 | 1 |
| 20200601 | 2020-06-01 07:26:38.000 | 2020-06-01 05:41:38.000 | 105 | 1 |
+----------+-------------------------+-------------------------+----------+----------+
我想过两种可能继续:
使用 window 函数,如 RANK(),我可以对行进行分区,但我无法弄清楚如何设置 PARTITION BY 条件来执行此操作.
用 WHILE 循环迭代 table - 再次发现很难形成 ths
这不能只用 window 函数来完成。您需要某种迭代过程,跟踪每组的第一行,并动态识别下一行。
在SQL中,您可以用递归查询来表达:
with
data as (select t.*, row_number() over(order by search_time) rn from mytable t),
cte as (
select d.*, search_time as first_search_time
from data d
where rn = 1
union all
select d.*,
case when d.search_time > dateadd(minute, 5, c.first_search_time)
then d.search_time
else c.first_search_time
end
from cte c
inner join data d on d.rn = c.rn + 1
)
select c.*, dense_rank() over(order by first_search_time) grp
from cte c
对于您的示例数据,this returns:
dt | search_time | searches | rn | first_search_time | grp :--------- | :---------------------- | -------: | -: | :---------------------- | --: 2020-06-01 | 2020-06-01 00:36:38.000 | 1 | 1 | 2020-06-01 00:36:38.000 | 1 2020-06-01 | 2020-06-01 00:37:38.000 | 1 | 2 | 2020-06-01 00:36:38.000 | 1 2020-06-01 | 2020-06-01 00:39:18.000 | 1 | 3 | 2020-06-01 00:36:38.000 | 1 2020-06-01 | 2020-06-01 01:16:18.000 | 1 | 4 | 2020-06-01 01:16:18.000 | 2 2020-06-01 | 2020-06-01 03:56:38.000 | 1 | 5 | 2020-06-01 03:56:38.000 | 3 2020-06-01 | 2020-06-01 05:36:38.000 | 1 | 6 | 2020-06-01 05:36:38.000 | 4 2020-06-01 | 2020-06-01 05:37:38.000 | 1 | 7 | 2020-06-01 05:36:38.000 | 4 2020-06-01 | 2020-06-01 05:39:38.000 | 1 | 8 | 2020-06-01 05:36:38.000 | 4 2020-06-01 | 2020-06-01 05:41:38.000 | 1 | 9 | 2020-06-01 05:36:38.000 | 4 2020-06-01 | 2020-06-01 07:26:38.000 | 1 | 10 | 2020-06-01 07:26:38.000 | 5