在 SQL 服务器中一次性获得 DISTINCT COUNT
Get DISTINCT COUNT in one pass in SQL Server
我有一个 table 如下所示:
Region Country Manufacturer Brand Period Spend
R1 C1 M1 B1 2016 5
R1 C1 M1 B1 2017 10
R1 C1 M1 B1 2017 20
R1 C1 M1 B2 2016 15
R1 C1 M1 B3 2017 20
R1 C2 M1 B1 2017 5
R1 C2 M2 B4 2017 25
R1 C2 M2 B5 2017 30
R2 C3 M1 B1 2017 35
R2 C3 M2 B4 2017 40
R2 C3 M2 B5 2017 45
...
我写了下面的查询来聚合它们:
SELECT [Region]
,[Country]
,[Manufacturer]
,[Brand]
,Period
,SUM([Spend]) AS [Spend]
FROM myTable
GROUP BY [Region]
,[Country]
,[Manufacturer]
,[Brand]
,[Period]
ORDER BY 1,2,3,4
产生如下内容:
Region Country Manufacturer Brand Period Spend
R1 C1 M1 B1 2016 5
R1 C1 M1 B1 2017 30 -- this row is an aggregate from raw table above
R1 C1 M1 B2 2016 15
R1 C1 M1 B3 2017 20
R1 C2 M1 B1 2017 4 -- aggregated result
R1 C2 M2 B4 2017 25
R1 C2 M2 B5 2017 30
R2 C3 M2 B4 2017 40
R2 C3 M2 B5 2017 45
我想在上面的 table 中添加另一列,显示按 Region
、Country
、[分组的 Brand
的 DISTINCT COUNT
=19=] 和 Period
。所以最后的 table 会变成如下:
Region Country Manufacturer Brand Period Spend UniqBrandCount
R1 C1 M1 B1 2016 5 2 -- two brands by R1, C1, M1 in 2016
R1 C1 M1 B1 2017 30 1
R1 C1 M1 B2 2016 15 2 -- same as first row's result
R1 C1 M1 B3 2017 20 1
R1 C2 M1 B1 2017 4 1
R1 C2 M2 B4 2017 25 2
R1 C2 M2 B5 2017 30 2
R2 C3 M2 B4 2017 40 2
R2 C3 M2 B5 2017 45 2
我知道如何分三步得出最终结果。
运行 此查询(查询 #1):
SELECT [地区]
,[国家]
,[制造商]
,[时期]
,COUNT(DISTINCT [Brand]) AS [BrandCount]
进入温度 1
从我的表
按 [地区] 分组
,[国家]
,[制造商]
,[句号]
运行 这个查询(查询#2)
SELECT [地区]
,[国家]
,[制造商]
,[品牌]
,YEAR([期间]) 作为期间
,SUM([支出]) AS [支出]
进入温度 2
从我的表
按 [地区] 分组
,[国家]
,[制造商]
,[品牌]
,[句点]
然后 LEFT JOIN
Temp2
和 Temp1
从后者引入 [BrandCount]
如下所示:
SELECT a.*
,b.*
从 Temp2 作为
LEFT JOIN Temp1 AS b ON a.[Region] = b.[Region]
AND a.[国家] = b.[国家]
AND a.[广告商] = b.[广告商]
AND a.[Period] = b.[Period]
我很确定有更有效的方法来做到这一点,是吗?预先感谢您的 suggestions/answers!
您问题的标签;
window-functions
表明你有一个很好的主意。
对于 按地区、国家、制造商和期间分组的品牌的 DISTINCT COUNT:您可以写:
Select Region
,Country
,Manufacturer
,Brand
,Period
,Spend
,DENSE_RANK() Over (Partition By Region, Country, Manufacturer, Period Order By Brand asc)
+ DENSE_RANK() Over (Partition By Region, Country, Manufacturer, Period Order By Brand desc)
-1 UniqBrandCount
From myTable T1
Order By 1,2,3,4
大量借鉴这个问题:https://dba.stackexchange.com/questions/89031/using-distinct-in-window-function-with-over
Count Distinct 不起作用,因此需要 dense_rank。对品牌进行正序排列和倒序排列,然后减去 1 得到不同的计数。
您的 sum 函数也可以使用 PARTITION BY
逻辑重写。这样您就可以为每个聚合使用不同的分组级别:
SELECT
[Region]
,[Country]
,[Manufacturer]
,[Brand]
,[Period]
,dense_rank() OVER
(PARTITION BY
[Region]
,[Country]
,[Manufacturer]
,[Period] Order by Brand)
+ dense_rank() OVER
(PARTITION BY
[Region]
,[Country]
,[Manufacturer]
,[Period] Order by Brand Desc)
- 1
AS [BrandCount]
,SUM([Spend]) OVER
(PARTITION BY
[Region]
,[Country]
,[Manufacturer]
,[Brand]
,[Period]) as [Spend]
from
myTable
ORDER BY 1,2,3,4
然后您可能需要减少输出中的行数,因为此语法提供与 myTable 相同的行数,但聚合总计出现在它们适用的每一行上:
R1 C1 M1 B1 2016 2 5
R1 C1 M1 B1 2017 2 30 --dup1
R1 C1 M1 B1 2017 2 30 --dup1
R1 C1 M1 B2 2016 2 15
R1 C1 M1 B3 2017 2 20
R1 C2 M1 B1 2017 1 5
R1 C2 M2 B4 2017 2 25
R1 C2 M2 B5 2017 2 30
R2 C3 M1 B1 2017 1 35
R2 C3 M2 B4 2017 2 40
R2 C3 M2 B5 2017 2 45
从此输出中选择不同的行即可满足您的需求。
dense_rank 技巧的工作原理
考虑这个数据:
Col1 Col2
B 1
B 1
B 3
B 5
B 7
B 9
dense_rank() 根据当前项之前的不同项的数量加 1 对数据进行排名。因此:
1->1, 3->2, 5->3, 7->4, 9->5.
以相反的顺序(使用 desc
)这会产生相反的模式:
1->5, 3->4, 5->3, 7->2, 9->1:
将这些排名相加得到相同的值:
1+5 = 2+4 = 3+3 = 4+2 = 5+1 = 6
这里的措辞很有帮助,
(number of distinct items before + 1) + (number of distinct items after + 1)
= number of distinct OTHER items before AND after + 2
= Total number of distinct items + 1
因此,要获得不同项目的总数,请将 ascending
和 descending
dense_rank 加在一起并减去 1。
双 dense_rank
想法意味着您需要两种排序(假设不存在提供排序顺序的索引)。假设没有 NULL
品牌(就像那个想法一样),您可以使用单个 dense_rank
和窗口 MAX
,如下所示 (demo)
WITH T1
AS (SELECT *,
DENSE_RANK() OVER (PARTITION BY [Region], [Country], [Manufacturer], [Period] ORDER BY Brand) AS [dr]
FROM myTable),
T2
AS (SELECT *,
MAX([dr]) OVER (PARTITION BY [Region], [Country], [Manufacturer], [Period]) AS UniqBrandCount
FROM T1)
SELECT [Region],
[Country],
[Manufacturer],
[Brand],
Period,
SUM([Spend]) AS [Spend],
MAX(UniqBrandCount) AS UniqBrandCount
FROM T2
GROUP BY [Region],
[Country],
[Manufacturer],
[Brand],
[Period]
ORDER BY [Region],
[Country],
[Manufacturer],
[Period],
Brand
上面有一些不可避免的假脱机(不可能以 100% 的流式处理方式做到这一点)但是单一排序。
奇怪的是,需要最终的 order by 子句才能将排序数保持为 1(如果存在合适的索引,则为 0)。
我有一个 table 如下所示:
Region Country Manufacturer Brand Period Spend
R1 C1 M1 B1 2016 5
R1 C1 M1 B1 2017 10
R1 C1 M1 B1 2017 20
R1 C1 M1 B2 2016 15
R1 C1 M1 B3 2017 20
R1 C2 M1 B1 2017 5
R1 C2 M2 B4 2017 25
R1 C2 M2 B5 2017 30
R2 C3 M1 B1 2017 35
R2 C3 M2 B4 2017 40
R2 C3 M2 B5 2017 45
...
我写了下面的查询来聚合它们:
SELECT [Region]
,[Country]
,[Manufacturer]
,[Brand]
,Period
,SUM([Spend]) AS [Spend]
FROM myTable
GROUP BY [Region]
,[Country]
,[Manufacturer]
,[Brand]
,[Period]
ORDER BY 1,2,3,4
产生如下内容:
Region Country Manufacturer Brand Period Spend
R1 C1 M1 B1 2016 5
R1 C1 M1 B1 2017 30 -- this row is an aggregate from raw table above
R1 C1 M1 B2 2016 15
R1 C1 M1 B3 2017 20
R1 C2 M1 B1 2017 4 -- aggregated result
R1 C2 M2 B4 2017 25
R1 C2 M2 B5 2017 30
R2 C3 M2 B4 2017 40
R2 C3 M2 B5 2017 45
我想在上面的 table 中添加另一列,显示按 Region
、Country
、[分组的 Brand
的 DISTINCT COUNT
=19=] 和 Period
。所以最后的 table 会变成如下:
Region Country Manufacturer Brand Period Spend UniqBrandCount
R1 C1 M1 B1 2016 5 2 -- two brands by R1, C1, M1 in 2016
R1 C1 M1 B1 2017 30 1
R1 C1 M1 B2 2016 15 2 -- same as first row's result
R1 C1 M1 B3 2017 20 1
R1 C2 M1 B1 2017 4 1
R1 C2 M2 B4 2017 25 2
R1 C2 M2 B5 2017 30 2
R2 C3 M2 B4 2017 40 2
R2 C3 M2 B5 2017 45 2
我知道如何分三步得出最终结果。
运行 此查询(查询 #1):
SELECT [地区] ,[国家] ,[制造商] ,[时期] ,COUNT(DISTINCT [Brand]) AS [BrandCount] 进入温度 1 从我的表 按 [地区] 分组 ,[国家] ,[制造商] ,[句号]
运行 这个查询(查询#2)
SELECT [地区] ,[国家] ,[制造商] ,[品牌] ,YEAR([期间]) 作为期间 ,SUM([支出]) AS [支出] 进入温度 2 从我的表 按 [地区] 分组 ,[国家] ,[制造商] ,[品牌] ,[句点]
然后
LEFT JOIN
Temp2
和Temp1
从后者引入[BrandCount]
如下所示:SELECT a.* ,b.* 从 Temp2 作为 LEFT JOIN Temp1 AS b ON a.[Region] = b.[Region] AND a.[国家] = b.[国家] AND a.[广告商] = b.[广告商] AND a.[Period] = b.[Period]
我很确定有更有效的方法来做到这一点,是吗?预先感谢您的 suggestions/answers!
您问题的标签;
window-functions
表明你有一个很好的主意。
对于 按地区、国家、制造商和期间分组的品牌的 DISTINCT COUNT:您可以写:
Select Region
,Country
,Manufacturer
,Brand
,Period
,Spend
,DENSE_RANK() Over (Partition By Region, Country, Manufacturer, Period Order By Brand asc)
+ DENSE_RANK() Over (Partition By Region, Country, Manufacturer, Period Order By Brand desc)
-1 UniqBrandCount
From myTable T1
Order By 1,2,3,4
大量借鉴这个问题:https://dba.stackexchange.com/questions/89031/using-distinct-in-window-function-with-over
Count Distinct 不起作用,因此需要 dense_rank。对品牌进行正序排列和倒序排列,然后减去 1 得到不同的计数。
您的 sum 函数也可以使用 PARTITION BY
逻辑重写。这样您就可以为每个聚合使用不同的分组级别:
SELECT
[Region]
,[Country]
,[Manufacturer]
,[Brand]
,[Period]
,dense_rank() OVER
(PARTITION BY
[Region]
,[Country]
,[Manufacturer]
,[Period] Order by Brand)
+ dense_rank() OVER
(PARTITION BY
[Region]
,[Country]
,[Manufacturer]
,[Period] Order by Brand Desc)
- 1
AS [BrandCount]
,SUM([Spend]) OVER
(PARTITION BY
[Region]
,[Country]
,[Manufacturer]
,[Brand]
,[Period]) as [Spend]
from
myTable
ORDER BY 1,2,3,4
然后您可能需要减少输出中的行数,因为此语法提供与 myTable 相同的行数,但聚合总计出现在它们适用的每一行上:
R1 C1 M1 B1 2016 2 5
R1 C1 M1 B1 2017 2 30 --dup1
R1 C1 M1 B1 2017 2 30 --dup1
R1 C1 M1 B2 2016 2 15
R1 C1 M1 B3 2017 2 20
R1 C2 M1 B1 2017 1 5
R1 C2 M2 B4 2017 2 25
R1 C2 M2 B5 2017 2 30
R2 C3 M1 B1 2017 1 35
R2 C3 M2 B4 2017 2 40
R2 C3 M2 B5 2017 2 45
从此输出中选择不同的行即可满足您的需求。
dense_rank 技巧的工作原理
考虑这个数据:
Col1 Col2
B 1
B 1
B 3
B 5
B 7
B 9
dense_rank() 根据当前项之前的不同项的数量加 1 对数据进行排名。因此:
1->1, 3->2, 5->3, 7->4, 9->5.
以相反的顺序(使用 desc
)这会产生相反的模式:
1->5, 3->4, 5->3, 7->2, 9->1:
将这些排名相加得到相同的值:
1+5 = 2+4 = 3+3 = 4+2 = 5+1 = 6
这里的措辞很有帮助,
(number of distinct items before + 1) + (number of distinct items after + 1)
= number of distinct OTHER items before AND after + 2
= Total number of distinct items + 1
因此,要获得不同项目的总数,请将 ascending
和 descending
dense_rank 加在一起并减去 1。
双 dense_rank
想法意味着您需要两种排序(假设不存在提供排序顺序的索引)。假设没有 NULL
品牌(就像那个想法一样),您可以使用单个 dense_rank
和窗口 MAX
,如下所示 (demo)
WITH T1
AS (SELECT *,
DENSE_RANK() OVER (PARTITION BY [Region], [Country], [Manufacturer], [Period] ORDER BY Brand) AS [dr]
FROM myTable),
T2
AS (SELECT *,
MAX([dr]) OVER (PARTITION BY [Region], [Country], [Manufacturer], [Period]) AS UniqBrandCount
FROM T1)
SELECT [Region],
[Country],
[Manufacturer],
[Brand],
Period,
SUM([Spend]) AS [Spend],
MAX(UniqBrandCount) AS UniqBrandCount
FROM T2
GROUP BY [Region],
[Country],
[Manufacturer],
[Brand],
[Period]
ORDER BY [Region],
[Country],
[Manufacturer],
[Period],
Brand
上面有一些不可避免的假脱机(不可能以 100% 的流式处理方式做到这一点)但是单一排序。
奇怪的是,需要最终的 order by 子句才能将排序数保持为 1(如果存在合适的索引,则为 0)。