根据 "type" 列组合连续的行
Combining consecutive rows based on a "type" column
我正在寻找想法和解决方案 T-SQL 来合并连续的记录,如下例所示。
我正在使用的源数据库将有审计记录,以及一个名为 "Audit_Type" 的列,其中可能包含许多不同的内容,例如 "Saved Form" "Exported Record"、"Imported Record" 或 "Viewed Record" 等。此数据库最终会包含一堆 "Saved Form" 类型的无关记录,因为创建此数据库的应用程序会在用户相当定期地编辑表单时自动保存表单。所以经常会出现一堆"Saved Form"条连续的记录。
上图:
ID Audit Type DateTime
1 "Viewed Record" 2017-01-03 11:16:33.000
2 "Saved Form" 2017-01-04 09:51:36.837
3 "Saved Form" 2017-01-04 09:52:40.837
4 "Saved Form" 2017-01-04 09:52:44.837
5 "Saved Form" 2017-01-04 09:52:49.837
6 "Saved Form" 2017-01-04 09:52:54.837
7 "Saved Form" 2017-01-04 09:54:59.837
8 "Exported Record" 2017-01-04 09:55:59.837
问题1.我想将这些连续的"Saved Form"条记录合并为一条记录,方法是抓取连续的"Saved Form"条记录并将它们合并为一条记录在将其加载到我的目标数据库之前使用最后一个 "Saved Form" 的时间戳。像这样
ID Audit Type DateTime
1 "Viewed Record" 2017-01-03 11:16:33.000
7 "Saved Form" 2017-01-04 09:54:59.837
8 "Exported Record" 2017-01-04 09:55:59.837
到目前为止我已经尝试了一些方法,但我想听听想法。
问题 2。 根据我的研究和阅读关于 SO 的其他类似问题,我发现这可能类似于孤岛和缺口问题,这是对此问题的准确描述吗问题?
编辑
这适用于 SQL Server 2012。
我正在从一个我无法控制它如何记录信息的数据库中提取。
还要澄清一下,此日志 table 中还有其他列,为简洁起见,我省略了,因此在上面的示例中,我们可以假设所有 "Saved Form" 记录都来自同一会话和同一用户
对于您的示例数据,您可以简单地执行以下操作:
select type, max(id) as id, max(datetime) as datetime
from t
group by type;
如果你有交错类型,你只需要一个间隙和孤岛的解决方案。即同一类型出现在两个不同的群体中
在您的例子中,您只是在查找更改前的最后一条记录。你可以使用 lead()
:
select t.*
from (select t.*,
lead(type) over (order by datetime) as next_type
from t
) t
where next_type is null or next_type <> type;
这比大多数间隙和孤岛问题更简单。
要处理 session/user,您可以在 group by
或 lead()
的分区子句中包含适当的列。
Gordon 先于我给出了答案。是的,这确实符合孤岛与差距方法。我认为 LEAD() 非常适合这个问题。但我还尝试了使用 ROW_NUMBER() 的第二个查询,它生成了一个稍微更短的执行计划。不确定哪个在规模上会更有效率。那将需要更多的测试。
注意 1: 我还在查询中添加了假定的 SessionID 和 UserID。额外的列可能会改变您的最终结果。
注意 2: SQL Fiddle 报告说 ROW_NUMBER 版本运行得更快, "same time" 条目更少,但是LEAD 版本速度更快,有许多 "same time" 个条目。
MS SQL Server 2014 架构设置:
CREATE TABLE foo ( ID int IDENTITY, sessionID int, userid int, AuditType varchar(50), [DateTime] datetime ) ;
INSERT INTO foo ( sessionID, userID, AuditType, [DateTime] )
VALUES
(1,1,'Viewed Record','2017-01-03 11:16:33.000')
, (1,1,'Saved Form','2017-01-04 09:51:36.837')
, (2,2,'Viewed Record','2017-01-04 09:52:00.000')
, (1,1,'Saved Form','2017-01-04 09:52:40.837')
, (1,1,'Saved Form','2017-01-04 09:52:44.837')
, (2,2,'Saved Form','2017-01-04 09:52:45.000')
, (2,2,'Saved Form','2017-01-04 09:52:46.000')
, (2,2,'Saved Form','2017-01-04 09:52:47.000')
, (2,2,'Saved Form','2017-01-04 09:52:48.000')
, (1,1,'Saved Form','2017-01-04 09:52:49.837')
, (1,1,'Saved Form','2017-01-04 09:52:54.837')
, (2,2,'Exported Record','2017-01-04 09:53:00.000')
, (1,1,'Saved Form','2017-01-04 09:54:59.837')
, (1,1,'Exported Record','2017-01-04 09:55:59.837')
, (2,1,'Viewed Record','2017-01-04 10:00:00.000')
, (2,1,'Saved Form','2017-01-04 10:02:00.000')
, (2,1,'Saved Form','2017-01-04 10:04:00.000')
, (2,1,'Saved Form','2017-01-04 10:06:00.000')
, (2,1,'Exported Record','2017-01-04 10:10:00.000')
;
查询 1 (LEAD()):
SELECT s1.sessionID
, s1.userID
, s1.AuditType
, s1.[DateTime]
FROM (
SELECT foo.*
, LEAD(foo.AuditType) OVER ( ORDER BY foo.userID, foo.sessionID, foo.[DateTime] ) AS next_type
FROM foo
) s1
WHERE s1.next_type IS NULL OR s1.next_type <> s1.AuditType
ORDER BY s1.sessionID, s1.userID, s1.[DateTime]
| sessionID | userID | AuditType | DateTime |
|-----------|--------|-----------------|--------------------------|
| 1 | 1 | Viewed Record | 2017-01-03T11:16:33Z |
| 1 | 1 | Saved Form | 2017-01-04T09:54:59.837Z |
| 1 | 1 | Exported Record | 2017-01-04T09:55:59.837Z |
| 2 | 1 | Viewed Record | 2017-01-04T10:00:00Z |
| 2 | 1 | Saved Form | 2017-01-04T10:06:00Z |
| 2 | 1 | Exported Record | 2017-01-04T10:10:00Z |
| 2 | 2 | Viewed Record | 2017-01-04T09:52:00Z |
| 2 | 2 | Saved Form | 2017-01-04T09:52:48Z |
| 2 | 2 | Exported Record | 2017-01-04T09:53:00Z |
查询 2 (ROW_NUMBER()):
SELECT s1.*
FROM (
SELECT foo.*
, ROW_NUMBER() OVER ( PARTITION BY foo.userID, foo.sessionID, foo.AuditType ORDER BY foo.userID, foo.sessionID, foo.[DateTime] DESC ) AS rn
FROM foo
) s1
WHERE rn = 1
ORDER BY s1.sessionID, s1.userID, s1.[DateTime]
| ID | sessionID | userid | AuditType | DateTime | rn |
|----|-----------|--------|-----------------|--------------------------|----|
| 1 | 1 | 1 | Viewed Record | 2017-01-03T11:16:33Z | 1 |
| 13 | 1 | 1 | Saved Form | 2017-01-04T09:54:59.837Z | 1 |
| 14 | 1 | 1 | Exported Record | 2017-01-04T09:55:59.837Z | 1 |
| 15 | 2 | 1 | Viewed Record | 2017-01-04T10:00:00Z | 1 |
| 18 | 2 | 1 | Saved Form | 2017-01-04T10:06:00Z | 1 |
| 19 | 2 | 1 | Exported Record | 2017-01-04T10:10:00Z | 1 |
| 3 | 2 | 2 | Viewed Record | 2017-01-04T09:52:00Z | 1 |
| 9 | 2 | 2 | Saved Form | 2017-01-04T09:52:48Z | 1 |
| 12 | 2 | 2 | Exported Record | 2017-01-04T09:53:00Z | 1 |
它们都应该显示:
1,1,'Viewed Record','2017-01-03 11:16:33.000'
1,1,'Saved Form','2017-01-04 09:54:59.837'
1,1,'Exported Record','2017-01-04 09:55:59.837'
2,1,'Viewed Record','2017-01-04 10:00:00.000'
2,1,'Saved Form','2017-01-04 10:06:00.000'
2,1,'Exported Record','2017-01-04 10:10:00.000'
2,2,'Viewed Record','2017-01-04 09:52:00.000'
2,2,'Saved Form','2017-01-04 09:52:48.000'
2,2,'Exported Record','2017-01-04 09:53:00.000'
我正在寻找想法和解决方案 T-SQL 来合并连续的记录,如下例所示。
我正在使用的源数据库将有审计记录,以及一个名为 "Audit_Type" 的列,其中可能包含许多不同的内容,例如 "Saved Form" "Exported Record"、"Imported Record" 或 "Viewed Record" 等。此数据库最终会包含一堆 "Saved Form" 类型的无关记录,因为创建此数据库的应用程序会在用户相当定期地编辑表单时自动保存表单。所以经常会出现一堆"Saved Form"条连续的记录。 上图:
ID Audit Type DateTime
1 "Viewed Record" 2017-01-03 11:16:33.000
2 "Saved Form" 2017-01-04 09:51:36.837
3 "Saved Form" 2017-01-04 09:52:40.837
4 "Saved Form" 2017-01-04 09:52:44.837
5 "Saved Form" 2017-01-04 09:52:49.837
6 "Saved Form" 2017-01-04 09:52:54.837
7 "Saved Form" 2017-01-04 09:54:59.837
8 "Exported Record" 2017-01-04 09:55:59.837
问题1.我想将这些连续的"Saved Form"条记录合并为一条记录,方法是抓取连续的"Saved Form"条记录并将它们合并为一条记录在将其加载到我的目标数据库之前使用最后一个 "Saved Form" 的时间戳。像这样
ID Audit Type DateTime
1 "Viewed Record" 2017-01-03 11:16:33.000
7 "Saved Form" 2017-01-04 09:54:59.837
8 "Exported Record" 2017-01-04 09:55:59.837
到目前为止我已经尝试了一些方法,但我想听听想法。
问题 2。 根据我的研究和阅读关于 SO 的其他类似问题,我发现这可能类似于孤岛和缺口问题,这是对此问题的准确描述吗问题?
编辑 这适用于 SQL Server 2012。 我正在从一个我无法控制它如何记录信息的数据库中提取。
还要澄清一下,此日志 table 中还有其他列,为简洁起见,我省略了,因此在上面的示例中,我们可以假设所有 "Saved Form" 记录都来自同一会话和同一用户
对于您的示例数据,您可以简单地执行以下操作:
select type, max(id) as id, max(datetime) as datetime
from t
group by type;
如果你有交错类型,你只需要一个间隙和孤岛的解决方案。即同一类型出现在两个不同的群体中
在您的例子中,您只是在查找更改前的最后一条记录。你可以使用 lead()
:
select t.*
from (select t.*,
lead(type) over (order by datetime) as next_type
from t
) t
where next_type is null or next_type <> type;
这比大多数间隙和孤岛问题更简单。
要处理 session/user,您可以在 group by
或 lead()
的分区子句中包含适当的列。
Gordon 先于我给出了答案。是的,这确实符合孤岛与差距方法。我认为 LEAD() 非常适合这个问题。但我还尝试了使用 ROW_NUMBER() 的第二个查询,它生成了一个稍微更短的执行计划。不确定哪个在规模上会更有效率。那将需要更多的测试。
注意 1: 我还在查询中添加了假定的 SessionID 和 UserID。额外的列可能会改变您的最终结果。
注意 2: SQL Fiddle 报告说 ROW_NUMBER 版本运行得更快, "same time" 条目更少,但是LEAD 版本速度更快,有许多 "same time" 个条目。
MS SQL Server 2014 架构设置:
CREATE TABLE foo ( ID int IDENTITY, sessionID int, userid int, AuditType varchar(50), [DateTime] datetime ) ;
INSERT INTO foo ( sessionID, userID, AuditType, [DateTime] )
VALUES
(1,1,'Viewed Record','2017-01-03 11:16:33.000')
, (1,1,'Saved Form','2017-01-04 09:51:36.837')
, (2,2,'Viewed Record','2017-01-04 09:52:00.000')
, (1,1,'Saved Form','2017-01-04 09:52:40.837')
, (1,1,'Saved Form','2017-01-04 09:52:44.837')
, (2,2,'Saved Form','2017-01-04 09:52:45.000')
, (2,2,'Saved Form','2017-01-04 09:52:46.000')
, (2,2,'Saved Form','2017-01-04 09:52:47.000')
, (2,2,'Saved Form','2017-01-04 09:52:48.000')
, (1,1,'Saved Form','2017-01-04 09:52:49.837')
, (1,1,'Saved Form','2017-01-04 09:52:54.837')
, (2,2,'Exported Record','2017-01-04 09:53:00.000')
, (1,1,'Saved Form','2017-01-04 09:54:59.837')
, (1,1,'Exported Record','2017-01-04 09:55:59.837')
, (2,1,'Viewed Record','2017-01-04 10:00:00.000')
, (2,1,'Saved Form','2017-01-04 10:02:00.000')
, (2,1,'Saved Form','2017-01-04 10:04:00.000')
, (2,1,'Saved Form','2017-01-04 10:06:00.000')
, (2,1,'Exported Record','2017-01-04 10:10:00.000')
;
查询 1 (LEAD()):
SELECT s1.sessionID
, s1.userID
, s1.AuditType
, s1.[DateTime]
FROM (
SELECT foo.*
, LEAD(foo.AuditType) OVER ( ORDER BY foo.userID, foo.sessionID, foo.[DateTime] ) AS next_type
FROM foo
) s1
WHERE s1.next_type IS NULL OR s1.next_type <> s1.AuditType
ORDER BY s1.sessionID, s1.userID, s1.[DateTime]
| sessionID | userID | AuditType | DateTime |
|-----------|--------|-----------------|--------------------------|
| 1 | 1 | Viewed Record | 2017-01-03T11:16:33Z |
| 1 | 1 | Saved Form | 2017-01-04T09:54:59.837Z |
| 1 | 1 | Exported Record | 2017-01-04T09:55:59.837Z |
| 2 | 1 | Viewed Record | 2017-01-04T10:00:00Z |
| 2 | 1 | Saved Form | 2017-01-04T10:06:00Z |
| 2 | 1 | Exported Record | 2017-01-04T10:10:00Z |
| 2 | 2 | Viewed Record | 2017-01-04T09:52:00Z |
| 2 | 2 | Saved Form | 2017-01-04T09:52:48Z |
| 2 | 2 | Exported Record | 2017-01-04T09:53:00Z |
查询 2 (ROW_NUMBER()):
SELECT s1.*
FROM (
SELECT foo.*
, ROW_NUMBER() OVER ( PARTITION BY foo.userID, foo.sessionID, foo.AuditType ORDER BY foo.userID, foo.sessionID, foo.[DateTime] DESC ) AS rn
FROM foo
) s1
WHERE rn = 1
ORDER BY s1.sessionID, s1.userID, s1.[DateTime]
| ID | sessionID | userid | AuditType | DateTime | rn |
|----|-----------|--------|-----------------|--------------------------|----|
| 1 | 1 | 1 | Viewed Record | 2017-01-03T11:16:33Z | 1 |
| 13 | 1 | 1 | Saved Form | 2017-01-04T09:54:59.837Z | 1 |
| 14 | 1 | 1 | Exported Record | 2017-01-04T09:55:59.837Z | 1 |
| 15 | 2 | 1 | Viewed Record | 2017-01-04T10:00:00Z | 1 |
| 18 | 2 | 1 | Saved Form | 2017-01-04T10:06:00Z | 1 |
| 19 | 2 | 1 | Exported Record | 2017-01-04T10:10:00Z | 1 |
| 3 | 2 | 2 | Viewed Record | 2017-01-04T09:52:00Z | 1 |
| 9 | 2 | 2 | Saved Form | 2017-01-04T09:52:48Z | 1 |
| 12 | 2 | 2 | Exported Record | 2017-01-04T09:53:00Z | 1 |
它们都应该显示:
1,1,'Viewed Record','2017-01-03 11:16:33.000'
1,1,'Saved Form','2017-01-04 09:54:59.837'
1,1,'Exported Record','2017-01-04 09:55:59.837'
2,1,'Viewed Record','2017-01-04 10:00:00.000'
2,1,'Saved Form','2017-01-04 10:06:00.000'
2,1,'Exported Record','2017-01-04 10:10:00.000'
2,2,'Viewed Record','2017-01-04 09:52:00.000'
2,2,'Saved Form','2017-01-04 09:52:48.000'
2,2,'Exported Record','2017-01-04 09:53:00.000'