重构 SQL Query ,使用 CTE 与 SubQuery
Refactor SQL Query , using CTE vs SubQuery
我正在从 S3 存储桶创建数据集,目前我正在尝试提高查询的性能,因为我目前使用的两种方法都有效,但我希望看到更好的查询并学习如何改进我的sql 技能。抱歉没有可用的示例数据集,因为我还没有想出从 S3 中的 .json 文件中提取时提供模拟数据的实用方法。
查询 # 1
WITH block_1 AS
(
SELECT
VALUE:COL1 AS COL1,
VALUE:COL2 AS COL2,
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4
from '@S3_BUCKET/',
lateral flatten( input => :value)), block_2 as
(
SELECT
VALUE:COL1 AS COL1,
max(VALUE:COL4) AS MaxCOL4
from '@S3_BUCKET/',
lateral flatten( input => :value)
group by COL1
)
select b.COL1 as COL1B, b.COLB as COL1B,
a.COL3, a.COL4 from block_1 as A
join block_2 b
on a.COL1 = b.COL1 and a.COL4 = b.MaxCOL4
;
QUERY #2,我觉得这是一个改进,特别是因为您不需要在最终的 SELECT
语句中指定您想要的列(就像我在上面所做的那样)
select a.* from
(
SELECT
VALUE:COL1 AS COL1,
VALUE:COL2 AS COL2,
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4
from '@S3_BUCKET/',
lateral flatten( input => :value))a
join
(
select COL1, MAX(COL4) COL4
from
(
SELECT
VALUE:COL1 AS COL1,
VALUE:COL2 AS COL2,
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4
from '@S3_BUCKET/',
lateral flatten( input => :value))
group by COL1) b
on a.COL1 = b.COL1 and a.COL4 = b.Col4;
以上两个是我目前的尝试,想知道有没有办法让这个查询更好?我想的另一条路线可能是使用"where in" , 和 COL1 的列表,但基本上我仍然必须点击 s3 2x ,如上面的查询。
您应该能够使用 window functions
, specifically RANK()
来简化此查询:
WITH block_1 AS (
SELECT
VALUE:COL1 AS COL1,
VALUE:COL2 AS COL2,
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4,
RANK() OVER (PARTITION BY VALUE:COL1 ORDER BY VALUE:COL4 DESC) AS rk
FROM '@S3_BUCKET/',
lateral flatten( input => :value)
)
SELECT COL1, COL2, COL3, COL4
FROM block_1
WHERE rk = 1
由于 Snowflake 的 QUALIFY
子句,这可以简化,它允许您在有效的 HAVING
子句中为 window 函数使用别名:
SELECT
VALUE:COL1 AS COL1,
VALUE:COL2 AS COL2,
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4,
RANK() OVER (PARTITION BY VALUE:COL1 ORDER BY VALUE:COL4 DESC) AS rk
FROM '@S3_BUCKET/',
lateral flatten( input => :value)
QUALIFY rk = 1
@尼克。使用 qualify ,这将充当 where filter 和 set = 1。还将 rank 替换为 row_number。那有意义吗 ?
我正在从 S3 存储桶创建数据集,目前我正在尝试提高查询的性能,因为我目前使用的两种方法都有效,但我希望看到更好的查询并学习如何改进我的sql 技能。抱歉没有可用的示例数据集,因为我还没有想出从 S3 中的 .json 文件中提取时提供模拟数据的实用方法。
查询 # 1
WITH block_1 AS
(
SELECT
VALUE:COL1 AS COL1,
VALUE:COL2 AS COL2,
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4
from '@S3_BUCKET/',
lateral flatten( input => :value)), block_2 as
(
SELECT
VALUE:COL1 AS COL1,
max(VALUE:COL4) AS MaxCOL4
from '@S3_BUCKET/',
lateral flatten( input => :value)
group by COL1
)
select b.COL1 as COL1B, b.COLB as COL1B,
a.COL3, a.COL4 from block_1 as A
join block_2 b
on a.COL1 = b.COL1 and a.COL4 = b.MaxCOL4
;
QUERY #2,我觉得这是一个改进,特别是因为您不需要在最终的 SELECT
语句中指定您想要的列(就像我在上面所做的那样)
select a.* from
(
SELECT
VALUE:COL1 AS COL1,
VALUE:COL2 AS COL2,
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4
from '@S3_BUCKET/',
lateral flatten( input => :value))a
join
(
select COL1, MAX(COL4) COL4
from
(
SELECT
VALUE:COL1 AS COL1,
VALUE:COL2 AS COL2,
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4
from '@S3_BUCKET/',
lateral flatten( input => :value))
group by COL1) b
on a.COL1 = b.COL1 and a.COL4 = b.Col4;
以上两个是我目前的尝试,想知道有没有办法让这个查询更好?我想的另一条路线可能是使用"where in" , 和 COL1 的列表,但基本上我仍然必须点击 s3 2x ,如上面的查询。
您应该能够使用 window functions
, specifically RANK()
来简化此查询:
WITH block_1 AS (
SELECT
VALUE:COL1 AS COL1,
VALUE:COL2 AS COL2,
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4,
RANK() OVER (PARTITION BY VALUE:COL1 ORDER BY VALUE:COL4 DESC) AS rk
FROM '@S3_BUCKET/',
lateral flatten( input => :value)
)
SELECT COL1, COL2, COL3, COL4
FROM block_1
WHERE rk = 1
由于 Snowflake 的 QUALIFY
子句,这可以简化,它允许您在有效的 HAVING
子句中为 window 函数使用别名:
SELECT
VALUE:COL1 AS COL1,
VALUE:COL2 AS COL2,
VALUE:COL3 AS COL3,
VALUE:COL4 AS COL4,
RANK() OVER (PARTITION BY VALUE:COL1 ORDER BY VALUE:COL4 DESC) AS rk
FROM '@S3_BUCKET/',
lateral flatten( input => :value)
QUALIFY rk = 1
@尼克。使用 qualify ,这将充当 where filter 和 set = 1。还将 rank 替换为 row_number。那有意义吗 ?