SQL 服务器 row_number() 超过分区依据,但忽略重复的分类值
SQL Server row_number() over partition by, but ignore repeating categorical values
我正在尝试通过多个类别跟踪不同的路径。我的 table 的简化视图如下所示:
Table: customer_category
CustomerID | Category | Date
11111 | A | 2016-01-01
11111 | B | 2016-02-01
11111 | C | 2016-03-01
22222 | A | 2016-01-01
22222 | A | 2016-02-01
22222 | A | 2016-03-01
22222 | C | 2016-04-01
33333 | A | 2016-01-01
33333 | B | 2016-02-01
33333 | C | 2016-03-01
33333 | C | 2016-04-01
我可以使用这个查询找到绝对路径:
with cat_order as (
select CustomerID
,Category
,row_number() over (partition by CustomerID order by Date) as rnk
from customer_category
),pivot as (
select CustomerID
,max(case when rnk = 1 then Category else null end) as category_1
,max(case when rnk = 2 then Category else null end) as category_2
,max(case when rnk = 3 then Category else null end) as category_3
,max(case when rnk = 4 then Category else null end) as category_4
from cat_order
group by CustomerID
)
select category_1, category_2, category_3, category_4, count(*) as count
from pivot
group by category_1, category_2, category_3, category_4
;
这给了我以下信息:
category_1 | category_2 | category_3 | category_4 | count
A | B | C | | 1
A | A | A | C | 1
A | B | C | C | 1
不过,我想要的是忽略重复的类别,这样我就会看到
category_1 | category_2 | category_3 | category_4 | count
A | B | C | | 2
A | C | | | 1
在我看来,我认为我需要
- 省略 Category = lag(category)
的任何记录
- 排名高于分区...
- 使用 case 语句进行透视
- 汇总结果
感觉太复杂了。有更简单的方法吗?
据我所知,没有更简单的方法(给定您的数据和所需的输出)。为了获得您想要的结果,您基本上需要执行您概述的四个步骤(或它的一些变体)。不过,您可以 "simplify" 以不需要 CTE 的方式使用它。例如:
SELECT category_1 = P.[1],
category_2 = P.[2],
category_3 = P.[3],
category_4 = P.[4],
[Count] = COUNT(*)
FROM
(
SELECT CustomerID,
Category,
rnk = SUM(checkprev) OVER (PARTITION BY CustomerID ORDER BY [Date])
FROM
(
SELECT *, checkprev = CASE WHEN LAG(Category) OVER (PARTITION BY CustomerID ORDER BY [Date]) = Category THEN 0 ELSE 1 END
FROM customer_category
) T
) AS T
PIVOT
(
MAX(Category) FOR rnk IN ([1], [2], [3], [4])
) AS P
GROUP BY P.[1], P.[2], P.[3], P.[4];
我正在尝试通过多个类别跟踪不同的路径。我的 table 的简化视图如下所示:
Table: customer_category
CustomerID | Category | Date
11111 | A | 2016-01-01
11111 | B | 2016-02-01
11111 | C | 2016-03-01
22222 | A | 2016-01-01
22222 | A | 2016-02-01
22222 | A | 2016-03-01
22222 | C | 2016-04-01
33333 | A | 2016-01-01
33333 | B | 2016-02-01
33333 | C | 2016-03-01
33333 | C | 2016-04-01
我可以使用这个查询找到绝对路径:
with cat_order as (
select CustomerID
,Category
,row_number() over (partition by CustomerID order by Date) as rnk
from customer_category
),pivot as (
select CustomerID
,max(case when rnk = 1 then Category else null end) as category_1
,max(case when rnk = 2 then Category else null end) as category_2
,max(case when rnk = 3 then Category else null end) as category_3
,max(case when rnk = 4 then Category else null end) as category_4
from cat_order
group by CustomerID
)
select category_1, category_2, category_3, category_4, count(*) as count
from pivot
group by category_1, category_2, category_3, category_4
;
这给了我以下信息:
category_1 | category_2 | category_3 | category_4 | count
A | B | C | | 1
A | A | A | C | 1
A | B | C | C | 1
不过,我想要的是忽略重复的类别,这样我就会看到
category_1 | category_2 | category_3 | category_4 | count
A | B | C | | 2
A | C | | | 1
在我看来,我认为我需要
- 省略 Category = lag(category) 的任何记录
- 排名高于分区...
- 使用 case 语句进行透视
- 汇总结果
感觉太复杂了。有更简单的方法吗?
据我所知,没有更简单的方法(给定您的数据和所需的输出)。为了获得您想要的结果,您基本上需要执行您概述的四个步骤(或它的一些变体)。不过,您可以 "simplify" 以不需要 CTE 的方式使用它。例如:
SELECT category_1 = P.[1],
category_2 = P.[2],
category_3 = P.[3],
category_4 = P.[4],
[Count] = COUNT(*)
FROM
(
SELECT CustomerID,
Category,
rnk = SUM(checkprev) OVER (PARTITION BY CustomerID ORDER BY [Date])
FROM
(
SELECT *, checkprev = CASE WHEN LAG(Category) OVER (PARTITION BY CustomerID ORDER BY [Date]) = Category THEN 0 ELSE 1 END
FROM customer_category
) T
) AS T
PIVOT
(
MAX(Category) FOR rnk IN ([1], [2], [3], [4])
) AS P
GROUP BY P.[1], P.[2], P.[3], P.[4];