聚类时间线或重建批号的有效方法
Efficient way to cluster a timeline OR reconstruct a batch number
我正在处理测试数据库的大型数据集(150k/天)。每行包含有关产品特定测试的数据。每个测试人员插入他的测试结果。
我想对每个产品和测试人员的班次进行一些测量,例如通过-失败率。问题是没有分配批号所以我不能 select 这么容易。
考虑整个table的给定subselect:
id tBegin orderId
------------------------------------
1 2018-10-20 00:00:05 1
2 2018-10-20 00:05:15 1
3 2018-10-20 01:00:05 1
10 2018-10-20 10:03:05 3
12 2018-10-20 11:04:05 8
20 2018-10-20 14:15:05 3
37 2018-10-20 18:12:05 1
我的目标是将数据聚类到以下
id tBegin orderId pCount
--------------------------------------------
1 2018-10-20 00:00:05 1 3
10 2018-10-20 10:03:05 3 1
12 2018-10-20 11:04:05 8 1
20 2018-10-20 14:15:05 3 1
37 2018-10-20 18:12:05 1 1
一个简单的 GROUP BY orderID
是行不通的,所以我想到了以下
SELECT
MIN(c.id) AS id,
MIN(c.tBegin) AS tBegin,
c.orderId,
COUNT(*) AS pCount
FROM (
SELECT t2.id, t2.tBegin, t2.orderId,
( SELECT TOP 1 t.id
FROM history t
WHERE t.tBegin > t2.tBegin
AND t.orderID <> t2.orderID
AND <restrict date here further>
ORDER BY t.tBegin
) AS nextId
FROM history t2
) AS c
WHERE <restrict date here>
GROUP BY c.orderID, c.nextId
我遗漏了 WHERE
select 正确的日期和测试人员。
这行得通,但接缝效率很低。我曾使用过小型数据库,但我是 SQL Server 2017 的新手。
非常感谢您的帮助!
您可以为此使用 window 个函数:
DECLARE @t TABLE (id INT, tBegin DATETIME, orderId INT);
INSERT INTO @t VALUES
(1 , '2018-10-20 00:00:05', 1),
(2 , '2018-10-20 00:05:15', 1),
(3 , '2018-10-20 01:00:05', 1),
(10, '2018-10-20 10:03:05', 3),
(12, '2018-10-20 11:04:05', 8),
(20, '2018-10-20 14:15:05', 3),
(37, '2018-10-20 18:12:05', 1);
WITH cte1 AS (
SELECT *, CASE WHEN orderId = LAG(orderId) OVER (ORDER BY tBegin) THEN 0 ELSE 1 END AS chg
FROM @t
), cte2 AS (
SELECT *, SUM(chg) OVER(ORDER BY tBegin) AS grp
FROM cte1
), cte3 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY grp ORDER BY tBegin) AS rn
FROM cte2
)
SELECT *
FROM cte3
WHERE rn = 1
- 第一个 cte 为值更改的每一行分配一个 "change flag"
- 第二个 cte 使用 运行 求和将 1 和 0 转换为可用于对行进行分组的数字
- 最后,您对每组中的行进行编号,select 每组第一行
您可以使用累积法:
select min(id) as id, max(tBegin), orderid, count(*)
from (select h.*,
row_number() over (order by id) as seq1,
row_number() over (partition by orderid order by id) as seq2
from history h
) h
group by orderid, (seq1 - seq2)
order by id;
我正在处理测试数据库的大型数据集(150k/天)。每行包含有关产品特定测试的数据。每个测试人员插入他的测试结果。
我想对每个产品和测试人员的班次进行一些测量,例如通过-失败率。问题是没有分配批号所以我不能 select 这么容易。
考虑整个table的给定subselect:
id tBegin orderId
------------------------------------
1 2018-10-20 00:00:05 1
2 2018-10-20 00:05:15 1
3 2018-10-20 01:00:05 1
10 2018-10-20 10:03:05 3
12 2018-10-20 11:04:05 8
20 2018-10-20 14:15:05 3
37 2018-10-20 18:12:05 1
我的目标是将数据聚类到以下
id tBegin orderId pCount
--------------------------------------------
1 2018-10-20 00:00:05 1 3
10 2018-10-20 10:03:05 3 1
12 2018-10-20 11:04:05 8 1
20 2018-10-20 14:15:05 3 1
37 2018-10-20 18:12:05 1 1
一个简单的 GROUP BY orderID
是行不通的,所以我想到了以下
SELECT
MIN(c.id) AS id,
MIN(c.tBegin) AS tBegin,
c.orderId,
COUNT(*) AS pCount
FROM (
SELECT t2.id, t2.tBegin, t2.orderId,
( SELECT TOP 1 t.id
FROM history t
WHERE t.tBegin > t2.tBegin
AND t.orderID <> t2.orderID
AND <restrict date here further>
ORDER BY t.tBegin
) AS nextId
FROM history t2
) AS c
WHERE <restrict date here>
GROUP BY c.orderID, c.nextId
我遗漏了 WHERE
select 正确的日期和测试人员。
这行得通,但接缝效率很低。我曾使用过小型数据库,但我是 SQL Server 2017 的新手。
非常感谢您的帮助!
您可以为此使用 window 个函数:
DECLARE @t TABLE (id INT, tBegin DATETIME, orderId INT);
INSERT INTO @t VALUES
(1 , '2018-10-20 00:00:05', 1),
(2 , '2018-10-20 00:05:15', 1),
(3 , '2018-10-20 01:00:05', 1),
(10, '2018-10-20 10:03:05', 3),
(12, '2018-10-20 11:04:05', 8),
(20, '2018-10-20 14:15:05', 3),
(37, '2018-10-20 18:12:05', 1);
WITH cte1 AS (
SELECT *, CASE WHEN orderId = LAG(orderId) OVER (ORDER BY tBegin) THEN 0 ELSE 1 END AS chg
FROM @t
), cte2 AS (
SELECT *, SUM(chg) OVER(ORDER BY tBegin) AS grp
FROM cte1
), cte3 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY grp ORDER BY tBegin) AS rn
FROM cte2
)
SELECT *
FROM cte3
WHERE rn = 1
- 第一个 cte 为值更改的每一行分配一个 "change flag"
- 第二个 cte 使用 运行 求和将 1 和 0 转换为可用于对行进行分组的数字
- 最后,您对每组中的行进行编号,select 每组第一行
您可以使用累积法:
select min(id) as id, max(tBegin), orderid, count(*)
from (select h.*,
row_number() over (order by id) as seq1,
row_number() over (partition by orderid order by id) as seq2
from history h
) h
group by orderid, (seq1 - seq2)
order by id;