Snowflake - 识别重复行并使用更新语句标记它们
Snowflake - Identify duplicate rows and flag them using update statement
我想识别 table 的重复行并向它们添加错误代码。我想在所有情况下都保留一个值,并将所有其他值标记为重复值。与 SQL 服务器不同,Snowflake 不支持在一个查询中使用 CTE
& UPDATE
语句。那么我该如何实施呢?
Table创建码:
DROP TABLE IF EXISTS DUP_CODE_TEST;
CREATE TABLE DUP_CODE_TEST
AS (
SELECT '1' AS PARENT,'OWN' AS REL, '11' AS CHILD, 'ROW1' AS X, NULL AS ERR_CD
UNION ALL
SELECT '1', 'OWN' AS REL, '11' , 'ROW2' , NULL
UNION ALL
SELECT '1', 'OWN' AS REL, '11' , 'ROW3' , NULL
);
来源Table:
+--------+-----+-------+------+--------+
| PARENT | REL | CHILD | X | ERR_CD |
+--------+-----+-------+------+--------+
| 1 | OWN | 11 | ROW1 | NULL |
| 1 | OWN | 11 | ROW2 | NULL |
| 1 | OWN | 11 | ROW3 | NULL |
+--------+-----+-------+------+--------+
我会在 SQL 服务器中执行此操作
WITH CTE_UPD
AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY PARENT,REL,CHILD ORDER BY X ) RN FROM DUP_CODE_TEST
)
UPDATE CTE_UPD
SET ERR_CD = 'AR-DUP'
WHERE RN = 2
预期输出为
+--------+-----+-------+------+-----------+
| PARENT | REL | CHILD | X | ERR_CD |
+--------+-----+-------+------+-----------+
| 1 | OWN | 11 | ROW1 | NULL |
| 1 | OWN | 11 | ROW2 | DUPLICATE |
| 1 | OWN | 11 | ROW3 | DUPLICATE |
+--------+-----+-------+------+-----------+
您可以做类似的事情——假设 X
是唯一的:
UPDATE DUP_CODE_TEST t
SET ERR_CD = 'AR-DUP'
FROM (SELECT PARENT, REL, CHILD, MIN(X) as MIN_X
FROM DUP_CODE_TEST tt
GROUP BY PARENT, REL, CHILD
) tt
WHERE t.PARENT = tt.PARENT AND t.REL = tt.REL AND
t.CHILD = tt.CHILD AND tt.X > t.MIN_X;
也就是说,Snowflake 确实支持连接到另一个 table(或子查询)。这总结了 table 以获得每个组的最小 X,然后将其用于更新。
我想识别 table 的重复行并向它们添加错误代码。我想在所有情况下都保留一个值,并将所有其他值标记为重复值。与 SQL 服务器不同,Snowflake 不支持在一个查询中使用 CTE
& UPDATE
语句。那么我该如何实施呢?
Table创建码:
DROP TABLE IF EXISTS DUP_CODE_TEST;
CREATE TABLE DUP_CODE_TEST
AS (
SELECT '1' AS PARENT,'OWN' AS REL, '11' AS CHILD, 'ROW1' AS X, NULL AS ERR_CD
UNION ALL
SELECT '1', 'OWN' AS REL, '11' , 'ROW2' , NULL
UNION ALL
SELECT '1', 'OWN' AS REL, '11' , 'ROW3' , NULL
);
来源Table:
+--------+-----+-------+------+--------+
| PARENT | REL | CHILD | X | ERR_CD |
+--------+-----+-------+------+--------+
| 1 | OWN | 11 | ROW1 | NULL |
| 1 | OWN | 11 | ROW2 | NULL |
| 1 | OWN | 11 | ROW3 | NULL |
+--------+-----+-------+------+--------+
我会在 SQL 服务器中执行此操作
WITH CTE_UPD
AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY PARENT,REL,CHILD ORDER BY X ) RN FROM DUP_CODE_TEST
)
UPDATE CTE_UPD
SET ERR_CD = 'AR-DUP'
WHERE RN = 2
预期输出为
+--------+-----+-------+------+-----------+
| PARENT | REL | CHILD | X | ERR_CD |
+--------+-----+-------+------+-----------+
| 1 | OWN | 11 | ROW1 | NULL |
| 1 | OWN | 11 | ROW2 | DUPLICATE |
| 1 | OWN | 11 | ROW3 | DUPLICATE |
+--------+-----+-------+------+-----------+
您可以做类似的事情——假设 X
是唯一的:
UPDATE DUP_CODE_TEST t
SET ERR_CD = 'AR-DUP'
FROM (SELECT PARENT, REL, CHILD, MIN(X) as MIN_X
FROM DUP_CODE_TEST tt
GROUP BY PARENT, REL, CHILD
) tt
WHERE t.PARENT = tt.PARENT AND t.REL = tt.REL AND
t.CHILD = tt.CHILD AND tt.X > t.MIN_X;
也就是说,Snowflake 确实支持连接到另一个 table(或子查询)。这总结了 table 以获得每个组的最小 X,然后将其用于更新。