计算下四分位数和上四分位数的精确值
Calculate Exact Values Of Lower and Upper Quartiles
我在网上到处找,似乎没有符合我情况的答案。
我正在努力计算 SQL 服务器中的确切下四分位数和上四分位数。
我知道 SQL 服务器有一个有助于计算四分位数的内置函数,即 NTILE 函数。但这对我来说还不够。
给定以下 table 值(请注意 table 包含的产品和价格比下面 table 中的更多):
AveragePrice
ProductNumber
Year
45.7820
2
2015
46.0142
2
2016
59.0133
2
2017
60.1707
2
2018
62.6600
2
2019
我是运行以下查询:
SELECT
AveragePrice
,NTILE(4) OVER (
PARTITION BY ProductNumber ORDER BY AveragePrice
) AS Quartile
FROM products
结果如下:
AveragePrice
Quartile
45.7820
1
46.0142
1
59.0133
2
60.1707
3
62.6600
4
对于完整的上下文,整个查询如下所示:
SELECT ProductNumber
,MIN(AveragePrice) Minimum
,MAX(CASE
WHEN Quartile = 1
THEN AveragePrice
END) AS Quartile_1
,
MAX(CASE
WHEN Quartile = 3
THEN AveragePrice
END) AS Quartile_3
,MAX(AveragePrice) Maximum
,COUNT(Quartile) AS 'Number of items'
FROM (
SELECT ProductNumber
,AveragePrice
,NTILE(4) OVER (
PARTITION BY ProductNumber ORDER BY ProductNumber
) AS Quartile
FROM #temp_products
) Vals
GROUP BY ProductNumber
ORDER BY ProductNumber
但是当我手动计算四分位数时,第一个四分位数应该是:45.8981(在这种特殊情况下第一行和第二行的平均值)而不是 46.0142。
第三个四分位数应该是 61.41535(在这种特殊情况下第三个和第二个四分位数的平均值)而不是 60.1707。
所以说清楚。这是存储过程的一部分,其中计算多个价格组并将其聚合到包含平均价格的组中。我需要根据这些按产品编号分组的平均价格计算上四分位数和下四分位数。结果集应包含产品编号、下四分位数和上四分位数。
有人可以帮助我或指导我正确的方向吗?
在某些情况下,NTILE()
以一些奇怪的方式四舍五入。我宁愿使用带等级的整数除法进行分组。
此解决方案适用于任意数量的值,并在需要时使用经过深思熟虑的平均值。
LEAD
是捕捉下一行值的魔法窗口函数
select *
,[Q] = case when [rank] in ((N+3)/4 ,(N+1)/2, (3*N+1)/4) then
case [decimal]
when 0 then AveragePrice
when 0.25 then /*pondered avg*/(3*AveragePrice + LEAD(AveragePrice,1,null)over(PARTITION BY ProductNumber ORDER BY AveragePrice)) / 4
when 0.5 then /*simple avg*/( AveragePrice + LEAD(AveragePrice,1,null)over(PARTITION BY ProductNumber ORDER BY AveragePrice)) / 2
when 0.75 then /*pondered avg*/( AveragePrice +3*LEAD(AveragePrice,1,null)over(PARTITION BY ProductNumber ORDER BY AveragePrice)) / 4
end
end
from(
select *
,[rank] = ROW_NUMBER()over(PARTITION BY ProductNumber ORDER BY AveragePrice)
,[N] = SUM(1)over()
,[group4] = ((ROW_NUMBER()over(PARTITION BY ProductNumber ORDER BY AveragePrice)-1 )*4 / SUM(1)over())
,[decimal] = case /*rank*/ROW_NUMBER()over(PARTITION BY ProductNumber ORDER BY AveragePrice)
when /*Q1*/ (SUM(1)over()+3)/4 then (SUM(1)over()+3)/4.0 - FLOOR((SUM(1)over()+3)/4.0)
when /*Q2*/ (SUM(1)over()+1)/2 then (SUM(1)over()+1)/2.0 - FLOOR((SUM(1)over()+1)/2.0)
when /*Q3*/(3*SUM(1)over()+1)/4 then (3*SUM(1)over()+1)/4.0 - FLOOR((3*SUM(1)over()+1)/4.0)
end
from
(values(45.7820,2,2015),(46.0142,2,2016),(59.0133,2,2017),(60.1707,2,2018),(62.6600,2,2019))a(AveragePrice,ProductNumber,Year)
)a
AveragePrice
ProductNumber
Year
rank
N
group4
decimal
Q
45.7820
2
2015
1
5
0
NULL
NULL
46.0142
2
2016
2
5
0
0.000000
46.014200
59.0133
2
2017
3
5
1
0.000000
59.013300
60.1707
2
2018
4
5
2
0.000000
60.170700
62.6600
2
2019
5
5
3
NULL
NULL
好的,受此启发 post 我设法构建了一个实际计算精确四分位数的查询:
-- ; since it is being used in a sp
;WITH quartile_data AS (
SELECT Price,
ProductNumber
FROM (values(29.4785,2,2015),(30.0000,2,2016),(33.4762,2,2017),(35.2917,2,2018),(35.8731,2,2018),(36.2475,2,2018),(37.9790,2,2018),(39.5846,2,2018),(67.4443,2,2018))sales(Price,ProductNumber)
)
--Aggregate into a single record for each group, using MAX to select the non-null
--detail value for each column
-- ISNULL check to include groups with three values as well
SELECT ProductNumber,
(Max(Q1NextVal) - MAX(Q1Val)) * Max(Q1Frac) + Max(Q1Val) as [Q1],
(Max(MidVal1) + Max(MidVal2)) / 2 [Median],
(ISNULL(Max(Q3NextVal),0) - MAX(Q3Val)) * Max(Q3Frac) + Max(Q3Val) as [Q3]
-- save the result into a temp table
INTO #my_temp_table
FROM (
--Expose the detail values for only the records at the index values
--generated by the summary subquery. All other values are left as NULL. som NULL.
SELECT detail.ProductNumber,
CASE WHEN RowNum = Q1Idx THEN Price ELSE NULL END Q1Val,
CASE WHEN RowNum = Q1Idx + 1 THEN Price ELSE NULL END Q1NextVal,
CASE WHEN RowNum = Q3Idx THEN Price ELSE NULL END Q3Val,
CASE WHEN RowNum = Q3Idx + 1 THEN Price ELSE NULL END Q3NextVal,
Q1Frac,
Q3Frac,
CASE WHEN RowNum = MidPt1 THEN Price ELSE NULL END MidVal1,
CASE WHEN RowNum = MidPt2 THEN Price ELSE NULL END MidVal2
FROM
--Calculate a row number sorted by measure for each group.
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ProductNumber ORDER BY Price) RowNum
FROM quartile_data) AS detail
INNER JOIN (
--Summarize to find index numbers and fractions we need to use to locate
--the values at the quartile points.
-- The modulus operator is used to sum the correct number if the number of rows in the group is even or uneven
SELECT ProductNumber,
FLOOR((COUNT(*) + IIF((COUNT(*) % 2 = 0), 2,1)) / 4.0) Q1Idx,
((COUNT(*) + IIF((COUNT(*) % 2 = 0), 2,1)) / 4.0) - FLOOR((COUNT(*) + IIF((COUNT(*) % 2 = 0), 2,1)) / 4.0) Q1Frac,
(COUNT(*) + 1) / 2 AS MidPt1,
(COUNT(*) + 2) / 2 AS Midpt2,
FLOOR((COUNT(*) * 3 + IIF((COUNT(*) % 2 = 0), 2,3)) / 4.0) Q3Idx,
((COUNT(*) * 3 + IIF((COUNT(*) % 2 = 0), 2,3)) / 4.0) - FLOOR((COUNT(*) * 3 + IIF((COUNT(*) % 2 = 0), 2,3)) / 4.0) Q3Frac
FROM quartile_data
GROUP BY ProductNumber
HAVING COUNT(*) > 1
) AS summary ON detail.ProductNumber = summary.ProductNumber
) AS step_two
GROUP BY ProductNumber
-- Include only groups with more than 2 rows
HAVING count(*) > 2
以下价格:
29.4785
30.0000
33.4762
35.2917
35.8731
36.2475
37.9790
39.5846
67.4443
给出正确的值:Q1 = 31.7381000000 和 Q3 = 38.7818000000
使用这个online tool
验证
我在网上到处找,似乎没有符合我情况的答案。
我正在努力计算 SQL 服务器中的确切下四分位数和上四分位数。 我知道 SQL 服务器有一个有助于计算四分位数的内置函数,即 NTILE 函数。但这对我来说还不够。
给定以下 table 值(请注意 table 包含的产品和价格比下面 table 中的更多):
AveragePrice | ProductNumber | Year |
---|---|---|
45.7820 | 2 | 2015 |
46.0142 | 2 | 2016 |
59.0133 | 2 | 2017 |
60.1707 | 2 | 2018 |
62.6600 | 2 | 2019 |
我是运行以下查询:
SELECT
AveragePrice
,NTILE(4) OVER (
PARTITION BY ProductNumber ORDER BY AveragePrice
) AS Quartile
FROM products
结果如下:
AveragePrice | Quartile |
---|---|
45.7820 | 1 |
46.0142 | 1 |
59.0133 | 2 |
60.1707 | 3 |
62.6600 | 4 |
对于完整的上下文,整个查询如下所示:
SELECT ProductNumber
,MIN(AveragePrice) Minimum
,MAX(CASE
WHEN Quartile = 1
THEN AveragePrice
END) AS Quartile_1
,
MAX(CASE
WHEN Quartile = 3
THEN AveragePrice
END) AS Quartile_3
,MAX(AveragePrice) Maximum
,COUNT(Quartile) AS 'Number of items'
FROM (
SELECT ProductNumber
,AveragePrice
,NTILE(4) OVER (
PARTITION BY ProductNumber ORDER BY ProductNumber
) AS Quartile
FROM #temp_products
) Vals
GROUP BY ProductNumber
ORDER BY ProductNumber
但是当我手动计算四分位数时,第一个四分位数应该是:45.8981(在这种特殊情况下第一行和第二行的平均值)而不是 46.0142。
第三个四分位数应该是 61.41535(在这种特殊情况下第三个和第二个四分位数的平均值)而不是 60.1707。
所以说清楚。这是存储过程的一部分,其中计算多个价格组并将其聚合到包含平均价格的组中。我需要根据这些按产品编号分组的平均价格计算上四分位数和下四分位数。结果集应包含产品编号、下四分位数和上四分位数。 有人可以帮助我或指导我正确的方向吗?
NTILE()
以一些奇怪的方式四舍五入。我宁愿使用带等级的整数除法进行分组。
此解决方案适用于任意数量的值,并在需要时使用经过深思熟虑的平均值。
LEAD
是捕捉下一行值的魔法窗口函数
select *
,[Q] = case when [rank] in ((N+3)/4 ,(N+1)/2, (3*N+1)/4) then
case [decimal]
when 0 then AveragePrice
when 0.25 then /*pondered avg*/(3*AveragePrice + LEAD(AveragePrice,1,null)over(PARTITION BY ProductNumber ORDER BY AveragePrice)) / 4
when 0.5 then /*simple avg*/( AveragePrice + LEAD(AveragePrice,1,null)over(PARTITION BY ProductNumber ORDER BY AveragePrice)) / 2
when 0.75 then /*pondered avg*/( AveragePrice +3*LEAD(AveragePrice,1,null)over(PARTITION BY ProductNumber ORDER BY AveragePrice)) / 4
end
end
from(
select *
,[rank] = ROW_NUMBER()over(PARTITION BY ProductNumber ORDER BY AveragePrice)
,[N] = SUM(1)over()
,[group4] = ((ROW_NUMBER()over(PARTITION BY ProductNumber ORDER BY AveragePrice)-1 )*4 / SUM(1)over())
,[decimal] = case /*rank*/ROW_NUMBER()over(PARTITION BY ProductNumber ORDER BY AveragePrice)
when /*Q1*/ (SUM(1)over()+3)/4 then (SUM(1)over()+3)/4.0 - FLOOR((SUM(1)over()+3)/4.0)
when /*Q2*/ (SUM(1)over()+1)/2 then (SUM(1)over()+1)/2.0 - FLOOR((SUM(1)over()+1)/2.0)
when /*Q3*/(3*SUM(1)over()+1)/4 then (3*SUM(1)over()+1)/4.0 - FLOOR((3*SUM(1)over()+1)/4.0)
end
from
(values(45.7820,2,2015),(46.0142,2,2016),(59.0133,2,2017),(60.1707,2,2018),(62.6600,2,2019))a(AveragePrice,ProductNumber,Year)
)a
AveragePrice | ProductNumber | Year | rank | N | group4 | decimal | Q |
---|---|---|---|---|---|---|---|
45.7820 | 2 | 2015 | 1 | 5 | 0 | NULL | NULL |
46.0142 | 2 | 2016 | 2 | 5 | 0 | 0.000000 | 46.014200 |
59.0133 | 2 | 2017 | 3 | 5 | 1 | 0.000000 | 59.013300 |
60.1707 | 2 | 2018 | 4 | 5 | 2 | 0.000000 | 60.170700 |
62.6600 | 2 | 2019 | 5 | 5 | 3 | NULL | NULL |
好的,受此启发 post 我设法构建了一个实际计算精确四分位数的查询:
-- ; since it is being used in a sp
;WITH quartile_data AS (
SELECT Price,
ProductNumber
FROM (values(29.4785,2,2015),(30.0000,2,2016),(33.4762,2,2017),(35.2917,2,2018),(35.8731,2,2018),(36.2475,2,2018),(37.9790,2,2018),(39.5846,2,2018),(67.4443,2,2018))sales(Price,ProductNumber)
)
--Aggregate into a single record for each group, using MAX to select the non-null
--detail value for each column
-- ISNULL check to include groups with three values as well
SELECT ProductNumber,
(Max(Q1NextVal) - MAX(Q1Val)) * Max(Q1Frac) + Max(Q1Val) as [Q1],
(Max(MidVal1) + Max(MidVal2)) / 2 [Median],
(ISNULL(Max(Q3NextVal),0) - MAX(Q3Val)) * Max(Q3Frac) + Max(Q3Val) as [Q3]
-- save the result into a temp table
INTO #my_temp_table
FROM (
--Expose the detail values for only the records at the index values
--generated by the summary subquery. All other values are left as NULL. som NULL.
SELECT detail.ProductNumber,
CASE WHEN RowNum = Q1Idx THEN Price ELSE NULL END Q1Val,
CASE WHEN RowNum = Q1Idx + 1 THEN Price ELSE NULL END Q1NextVal,
CASE WHEN RowNum = Q3Idx THEN Price ELSE NULL END Q3Val,
CASE WHEN RowNum = Q3Idx + 1 THEN Price ELSE NULL END Q3NextVal,
Q1Frac,
Q3Frac,
CASE WHEN RowNum = MidPt1 THEN Price ELSE NULL END MidVal1,
CASE WHEN RowNum = MidPt2 THEN Price ELSE NULL END MidVal2
FROM
--Calculate a row number sorted by measure for each group.
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ProductNumber ORDER BY Price) RowNum
FROM quartile_data) AS detail
INNER JOIN (
--Summarize to find index numbers and fractions we need to use to locate
--the values at the quartile points.
-- The modulus operator is used to sum the correct number if the number of rows in the group is even or uneven
SELECT ProductNumber,
FLOOR((COUNT(*) + IIF((COUNT(*) % 2 = 0), 2,1)) / 4.0) Q1Idx,
((COUNT(*) + IIF((COUNT(*) % 2 = 0), 2,1)) / 4.0) - FLOOR((COUNT(*) + IIF((COUNT(*) % 2 = 0), 2,1)) / 4.0) Q1Frac,
(COUNT(*) + 1) / 2 AS MidPt1,
(COUNT(*) + 2) / 2 AS Midpt2,
FLOOR((COUNT(*) * 3 + IIF((COUNT(*) % 2 = 0), 2,3)) / 4.0) Q3Idx,
((COUNT(*) * 3 + IIF((COUNT(*) % 2 = 0), 2,3)) / 4.0) - FLOOR((COUNT(*) * 3 + IIF((COUNT(*) % 2 = 0), 2,3)) / 4.0) Q3Frac
FROM quartile_data
GROUP BY ProductNumber
HAVING COUNT(*) > 1
) AS summary ON detail.ProductNumber = summary.ProductNumber
) AS step_two
GROUP BY ProductNumber
-- Include only groups with more than 2 rows
HAVING count(*) > 2
以下价格: 29.4785 30.0000 33.4762 35.2917 35.8731 36.2475 37.9790 39.5846 67.4443
给出正确的值:Q1 = 31.7381000000 和 Q3 = 38.7818000000
使用这个online tool
验证