使用 SQL 计算 group/partition 的累积百分位数
Calculate cumulative percentiles using SQL for a group/partition
我想计算 SQL 中给定 partition/group 的 累积 百分位数。例如输入数据看起来像 -
CustID Product ID quantity_purchased
1 111 2
2 111 3
3 111 2
4 111 5
1 222 2
2 222 6
4 222 7
6 222 2
我想获得每个产品 ID 组的 累积 个百分位数。输出应该是 -
Product ID min 25% 50% 75% max
111 2 2 2.5 3.5 5
222 2 2 2.5 5.25 7
所以,基本上对于产品 ID 111,我只需要为产品 ID 111 取 quantity_purchased 的百分位数,但是当我在该列中继续进行时,百分位数应该是产品 ID 222 的累积意义,将考虑产品 ID 111 和产品 ID 222 的 quantity_purchased 值计算百分位数 (2,3,2,5,2,6 ,7,2).同样,如果数据中有产品 ID 333,那么对于产品 ID 333,我将根据与产品 111、产品 222 和产品 333 关联的所有 quantity_purchased 值计算百分位数,并将结果存储在产品中333行。如何使用 SQL?
实现此目的
这使用 PERCENTILE_CONT instead of PERCENTILE_DISC 返回 val 的主要区别是基于使用线性插值的连续分布,其中值不完美排列 - 根据您的用例,这可能会提供更准确的数据点。 :-)
select
ProductID,
min(Quantity_Purchased::float) min,
PERCENTILE_CONT(.25) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "25%",
PERCENTILE_CONT(.50) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "50%",
PERCENTILE_CONT(.75) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "75%" ,
max(Quantity_Purchased) max
from
cte
group by
1
在 Snowflake
中复制|粘贴|运行
with cte as (
select
1 CustID,
111 ProductID,
2 Quantity_Purchased
union
select
2 CustID,
111 ProductID,
3 Quantity_Purchased
union
select
3 CustID,
111 ProductID,
2 Quantity_Purchased
union
select
4 CustID,
111 ProductID,
5 Quantity_Purchased
union
select
1 CustID,
222 ProductID,
2 Quantity_Purchased
union
select
2 CustID,
222 ProductID,
6 Quantity_Purchased
union
select
4 CustID,
222 ProductID,
7 Quantity_Purchased
union
select
6 CustID,
222 ProductID,
2 Quantity_Purchased
)
select
ProductID,
min(Quantity_Purchased::float) min,
PERCENTILE_CONT(.25) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "25%",
PERCENTILE_CONT(.50) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "50%",
PERCENTILE_CONT(.75) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "75%" ,
max(Quantity_Purchased) max
from
cte
group by
1
这非常好奇,但我认为您需要扩展每个产品 ID 的数据:
select t.product_id, min(t2.quantity_purchased), max(t2.quantity_purchased),
percentile_cont(0.25) within group (order by t2.quantity_purchased),
percentile_cont(0.50) within group (order by t2.quantity_purchased),
percentile_cont(0.75) within group (order by t2.quantity_purchased)
from t join
t t2
on t2.product_id <= t.product_id
group by t1.product_id;
我想计算 SQL 中给定 partition/group 的 累积 百分位数。例如输入数据看起来像 -
CustID Product ID quantity_purchased
1 111 2
2 111 3
3 111 2
4 111 5
1 222 2
2 222 6
4 222 7
6 222 2
我想获得每个产品 ID 组的 累积 个百分位数。输出应该是 -
Product ID min 25% 50% 75% max
111 2 2 2.5 3.5 5
222 2 2 2.5 5.25 7
所以,基本上对于产品 ID 111,我只需要为产品 ID 111 取 quantity_purchased 的百分位数,但是当我在该列中继续进行时,百分位数应该是产品 ID 222 的累积意义,将考虑产品 ID 111 和产品 ID 222 的 quantity_purchased 值计算百分位数 (2,3,2,5,2,6 ,7,2).同样,如果数据中有产品 ID 333,那么对于产品 ID 333,我将根据与产品 111、产品 222 和产品 333 关联的所有 quantity_purchased 值计算百分位数,并将结果存储在产品中333行。如何使用 SQL?
实现此目的这使用 PERCENTILE_CONT instead of PERCENTILE_DISC 返回 val 的主要区别是基于使用线性插值的连续分布,其中值不完美排列 - 根据您的用例,这可能会提供更准确的数据点。 :-)
select
ProductID,
min(Quantity_Purchased::float) min,
PERCENTILE_CONT(.25) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "25%",
PERCENTILE_CONT(.50) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "50%",
PERCENTILE_CONT(.75) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "75%" ,
max(Quantity_Purchased) max
from
cte
group by
1
在 Snowflake
中复制|粘贴|运行with cte as (
select
1 CustID,
111 ProductID,
2 Quantity_Purchased
union
select
2 CustID,
111 ProductID,
3 Quantity_Purchased
union
select
3 CustID,
111 ProductID,
2 Quantity_Purchased
union
select
4 CustID,
111 ProductID,
5 Quantity_Purchased
union
select
1 CustID,
222 ProductID,
2 Quantity_Purchased
union
select
2 CustID,
222 ProductID,
6 Quantity_Purchased
union
select
4 CustID,
222 ProductID,
7 Quantity_Purchased
union
select
6 CustID,
222 ProductID,
2 Quantity_Purchased
)
select
ProductID,
min(Quantity_Purchased::float) min,
PERCENTILE_CONT(.25) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "25%",
PERCENTILE_CONT(.50) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "50%",
PERCENTILE_CONT(.75) WITHIN GROUP (ORDER BY Quantity_Purchased ) as "75%" ,
max(Quantity_Purchased) max
from
cte
group by
1
这非常好奇,但我认为您需要扩展每个产品 ID 的数据:
select t.product_id, min(t2.quantity_purchased), max(t2.quantity_purchased),
percentile_cont(0.25) within group (order by t2.quantity_purchased),
percentile_cont(0.50) within group (order by t2.quantity_purchased),
percentile_cont(0.75) within group (order by t2.quantity_purchased)
from t join
t t2
on t2.product_id <= t.product_id
group by t1.product_id;