重构 SQL Query ,使用 CTE 与 SubQuery

Refactor SQL Query , using CTE vs SubQuery

我正在从 S3 存储桶创建数据集,目前我正在尝试提高查询的性能,因为我目前使用的两种方法都有效,但我希望看到更好的查询并学习如何改进我的sql 技能。抱歉没有可用的示例数据集,因为我还没有想出从 S3 中的 .json 文件中提取时提供模拟数据的实用方法。

查询 # 1

 WITH block_1 AS
    (
    SELECT 
    VALUE:COL1 AS COL1, 
    VALUE:COL2 AS COL2, 
    VALUE:COL3 AS COL3,
    VALUE:COL4 AS COL4
    from '@S3_BUCKET/', 
     lateral flatten( input => :value)), block_2 as 

(
SELECT 
VALUE:COL1 AS COL1, 
max(VALUE:COL4) AS MaxCOL4
from '@S3_BUCKET/', 
lateral flatten( input => :value)
group by COL1
 )

select b.COL1 as COL1B, b.COLB as COL1B, 
 a.COL3, a.COL4 from block_1 as A
join block_2 b 
on a.COL1 = b.COL1  and a.COL4 = b.MaxCOL4
 ;

QUERY #2,我觉得这是一个改进,特别是因为您不需要在最终的 SELECT 语句中指定您想要的列(就像我在上面所做的那样)

select a.* from 
(
SELECT 
VALUE:COL1 AS COL1, 
VALUE:COL2 AS COL2, 
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4
from '@S3_BUCKET/', 
lateral flatten( input => :value))a 
join 
(
select COL1, MAX(COL4) COL4
from 
(
SELECT 
VALUE:COL1 AS COL1, 
VALUE:COL2 AS COL2, 
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4
from '@S3_BUCKET/', 
 lateral flatten( input => :value))
group by COL1) b
on a.COL1 = b.COL1 and a.COL4 = b.Col4;

以上两个是我目前的尝试,想知道有没有办法让这个查询更好?我想的另一条路线可能是使用"where in" , 和 COL1 的列表,但基本上我仍然必须点击 s3 2x ,如上面的查询。

您应该能够使用 window functions, specifically RANK() 来简化此查询:

WITH block_1 AS (
    SELECT 
    VALUE:COL1 AS COL1, 
    VALUE:COL2 AS COL2, 
    VALUE:COL3 AS COL3,
    VALUE:COL4 AS COL4,
    RANK() OVER (PARTITION BY VALUE:COL1 ORDER BY VALUE:COL4 DESC) AS rk
    FROM '@S3_BUCKET/', 
     lateral flatten( input => :value)
)
SELECT COL1, COL2, COL3, COL4
FROM block_1
WHERE rk = 1

由于 Snowflake 的 QUALIFY 子句,这可以简化,它允许您在有效的 HAVING 子句中为 window 函数使用别名:

SELECT 
    VALUE:COL1 AS COL1, 
    VALUE:COL2 AS COL2, 
    VALUE:COL3 AS COL3,
    VALUE:COL4 AS COL4,
    RANK() OVER (PARTITION BY VALUE:COL1 ORDER BY VALUE:COL4 DESC) AS rk
FROM '@S3_BUCKET/', 
     lateral flatten( input => :value)
QUALIFY rk = 1

@尼克。使用 qualify ,这将充当 where filter 和 set = 1。还将 rank 替换为 row_number。那有意义吗 ?