SQL - 组中的 Max() 值无效
SQL - Max() value from the group is not working
图片 1:示例数据
图 2:输出不正确
图 3:所需输出
查询:我正在尝试通过 Class_Name 和客户(image1 样本数据)
从列 (Median_Percentage) 中查询最大值
问题:查询显示所有客户,而不是具有最大中值的客户(image2 结果不正确)。它正在正确计算 Max() 但查询将所有客户的值而不是在 Class_Name
内具有该最大值的客户的值
我需要的只是Class_Name有Max(Median_Percentage)的给客户看。 (image3 所需的输出)
Select
distinct
C.Class_Name,
C.Customer,
C.Max_Median_Percentage
FROM (
SELECT
B.Class_Name,
case (when B.Median_Percentage = Max(B.Median_Percentage) OVER(PARTITION By B.Class_Name ORDER BY B.Median_Percentage desc )
then B.Customer
end as Customer,
Max(B.Median_Percentage) OVER(PARTITION By B.Class_Name ORDER BY B.Median_Percentage desc ) as Max_Median_Percentage
FROM (
SELECT
A.Class_Name,
A.Customer,
A.Date_Time
A.Median_Percentage
From table1 as A
) as B
) as C
如果您的数据库不直接支持“中值”函数,您可以使用 percentile_cont()
:
select t.*,
boot_time / percentile_cont(0.5) within group (order by boot_time) over (partition by classid)
from t;
如果您的数据库没有 percentile_cont()
或 percentile_disc()
函数,您可以使用简单的 ntile()
:
获得非常接近的结果
select t.*,
boot_time / max(case when tile = 1 then boot_time end) over (partition by classid)
from (select t.*,
ntile(2) over (partition by classid order by boot_time) as tile
from t
) t
如果 classid
中的行数为奇数,这将完全有效。对于偶数,它相差 1。您可以轻松处理,但更复杂:
select t.*,
(boot_time /
(( max(case when tile_asc = 1 then boot_time end) over (partition by classid) / 2 +
max(case when tile_desc = 1 then boot_time end) over (partition by classid)
) / 2
)
)
from (select t.*,
ntile(2) over (partition by classid order by boot_time) as tile_asc,
ntile(2) over (partition by classid order by boot_time desc) as tile_desc
from t
) t
也许这有用 -
加载提供的测试数据
val df = spark.sql(
"""
|select Class_Name, Customer, Date_Time, Median_Percentage
|from values
| ('ClassA', 'A', '6/13/20', 64550),
| ('ClassA', 'B', '6/6/20', 40200),
| ('ClassB', 'F', '6/20/20', 26800),
| ('ClassB', 'G', '6/20/20', 18100)
| T(Class_Name, Customer, Date_Time, Median_Percentage)
""".stripMargin)
df.show(false)
df.printSchema()
/**
* +----------+--------+---------+-----------------+
* |Class_Name|Customer|Date_Time|Median_Percentage|
* +----------+--------+---------+-----------------+
* |ClassA |A |6/13/20 |64550 |
* |ClassA |B |6/6/20 |40200 |
* |ClassB |F |6/20/20 |26800 |
* |ClassB |G |6/20/20 |18100 |
* +----------+--------+---------+-----------------+
*
* root
* |-- Class_Name: string (nullable = false)
* |-- Customer: string (nullable = false)
* |-- Date_Time: string (nullable = false)
* |-- Median_Percentage: integer (nullable = false)
*/
通过 Class_Name
找到最大 Median_Percentage
行
df.groupBy("Class_Name")
.agg(max(struct($"Median_Percentage", $"Date_Time", $"Customer")).as("struct"))
.selectExpr("Class_Name", "struct.Customer", "struct.Date_Time", "struct.Median_Percentage")
.show(false)
/**
* +----------+--------+---------+-----------------+
* |Class_Name|Customer|Date_Time|Median_Percentage|
* +----------+--------+---------+-----------------+
* |ClassA |A |6/13/20 |64550 |
* |ClassB |F |6/20/20 |26800 |
* +----------+--------+---------+-----------------+
*/
图片 1:示例数据
图 2:输出不正确
图 3:所需输出
查询:我正在尝试通过 Class_Name 和客户(image1 样本数据)
从列 (Median_Percentage) 中查询最大值问题:查询显示所有客户,而不是具有最大中值的客户(image2 结果不正确)。它正在正确计算 Max() 但查询将所有客户的值而不是在 Class_Name
内具有该最大值的客户的值我需要的只是Class_Name有Max(Median_Percentage)的给客户看。 (image3 所需的输出)
Select
distinct
C.Class_Name,
C.Customer,
C.Max_Median_Percentage
FROM (
SELECT
B.Class_Name,
case (when B.Median_Percentage = Max(B.Median_Percentage) OVER(PARTITION By B.Class_Name ORDER BY B.Median_Percentage desc )
then B.Customer
end as Customer,
Max(B.Median_Percentage) OVER(PARTITION By B.Class_Name ORDER BY B.Median_Percentage desc ) as Max_Median_Percentage
FROM (
SELECT
A.Class_Name,
A.Customer,
A.Date_Time
A.Median_Percentage
From table1 as A
) as B
) as C
如果您的数据库不直接支持“中值”函数,您可以使用 percentile_cont()
:
select t.*,
boot_time / percentile_cont(0.5) within group (order by boot_time) over (partition by classid)
from t;
如果您的数据库没有 percentile_cont()
或 percentile_disc()
函数,您可以使用简单的 ntile()
:
select t.*,
boot_time / max(case when tile = 1 then boot_time end) over (partition by classid)
from (select t.*,
ntile(2) over (partition by classid order by boot_time) as tile
from t
) t
如果 classid
中的行数为奇数,这将完全有效。对于偶数,它相差 1。您可以轻松处理,但更复杂:
select t.*,
(boot_time /
(( max(case when tile_asc = 1 then boot_time end) over (partition by classid) / 2 +
max(case when tile_desc = 1 then boot_time end) over (partition by classid)
) / 2
)
)
from (select t.*,
ntile(2) over (partition by classid order by boot_time) as tile_asc,
ntile(2) over (partition by classid order by boot_time desc) as tile_desc
from t
) t
也许这有用 -
加载提供的测试数据
val df = spark.sql(
"""
|select Class_Name, Customer, Date_Time, Median_Percentage
|from values
| ('ClassA', 'A', '6/13/20', 64550),
| ('ClassA', 'B', '6/6/20', 40200),
| ('ClassB', 'F', '6/20/20', 26800),
| ('ClassB', 'G', '6/20/20', 18100)
| T(Class_Name, Customer, Date_Time, Median_Percentage)
""".stripMargin)
df.show(false)
df.printSchema()
/**
* +----------+--------+---------+-----------------+
* |Class_Name|Customer|Date_Time|Median_Percentage|
* +----------+--------+---------+-----------------+
* |ClassA |A |6/13/20 |64550 |
* |ClassA |B |6/6/20 |40200 |
* |ClassB |F |6/20/20 |26800 |
* |ClassB |G |6/20/20 |18100 |
* +----------+--------+---------+-----------------+
*
* root
* |-- Class_Name: string (nullable = false)
* |-- Customer: string (nullable = false)
* |-- Date_Time: string (nullable = false)
* |-- Median_Percentage: integer (nullable = false)
*/
通过 Class_Name
找到最大 Median_Percentage
行
df.groupBy("Class_Name")
.agg(max(struct($"Median_Percentage", $"Date_Time", $"Customer")).as("struct"))
.selectExpr("Class_Name", "struct.Customer", "struct.Date_Time", "struct.Median_Percentage")
.show(false)
/**
* +----------+--------+---------+-----------------+
* |Class_Name|Customer|Date_Time|Median_Percentage|
* +----------+--------+---------+-----------------+
* |ClassA |A |6/13/20 |64550 |
* |ClassB |F |6/20/20 |26800 |
* +----------+--------+---------+-----------------+
*/