使用 postgresql 删除无索引 table 中的重复记录
Drop duplicates records in a no index table using postgresql
我有一个table如下图
Subject_id subject_name Standard Rank Previous_subject_id
13 ABC 1st 1 21
13 ABC 1st 1 23
13 ABC 1st 1 13
25 def 3rd 6 42
25 def 3rd 6 25
25 def 3rd 6 28
25 XYZ 2nd 7 26
29 PQR 1st 1 31
如您所见,除 previous_subject_id
列(一行)外,所有列和值都相同。
规则 1
如果规则1之后还有重复项,我想做的是删除所有满足subject_id = previous_subject_id
?
条件的人
规则 2
如果还有重复的subject_ids,则只保留第一个(出现的)记录
正如您在下面的示例输出中看到的,我只保留了第一个出现的记录。
我希望输出如下所示
Subject_id subject_name Standard Rank Previous_subject_id
13 ABC 1st 1 21
25 def 3rd 6 42
25 XYZ 2nd 7 26
29 PQR 1st 1 31
唯一的问题是我的 table 有 285000 条记录并且没有编入索引。删除记录后,我将能够将索引设置为 subject_id
,因为它们变得唯一。
这是我试过的
select * from subject_class a
inner join
subject_class b
on a.subject_id = b.previous_subject_id
虽然上面的查询由于索引的问题让运行保持了很长时间,请问有什么有效的方法吗?
但是我该如何放下它们呢?
可以帮我解决这个问题吗?
我不明白你为什么要使用 JOIN
,虽然这看起来很简单:
DELETE FROM subject_class WHERE subject_id = previous_subject_id
?
另外,285,000行也不算多,性能应该没问题。但是,285,000 * 285,000(810 亿)是一个很大的数字,这基本上就是您使用 JOIN
的查询必须解决的问题。
好的,现在我们有问题了。在关系数据库中,没有 "first" 或 "last" 的概念。行没有任何固有的顺序,除非你告诉他们一些东西来排序。在您的示例中,您已经直观地选择了两行以从列表中保留,纯粹是基于当您列出它们时,这是它们出现的顺序。但是,该顺序是完全不确定的。这实际上可能是数据插入堆的顺序(非索引 table),但这几乎不可能复制,并且超出了这个问题的范围。
我能做的是提供一种确定性的方法来删除行。因为这个比较复杂,我先设置一些测试数据:
DECLARE @subject_class TABLE (
subject_id INT,
subject_name VARCHAR(20),
[standard] VARCHAR(20),
[rank] INT,
previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
这基本上就是您的设置,您在没有索引的 table 中列出的数据。
第一部分很简单:
DELETE FROM @subject_class WHERE subject_id = previous_subject_id; --fixes 2 records
第二部分稍微复杂一些,所以我使用了一个常见的table表达式:
WITH cte AS (
SELECT
subject_id,
MIN(previous_subject_id) AS min_previous_subject_id
FROM
@subject_class
GROUP BY
subject_id)
DELETE
s
FROM
@subject_class s
INNER JOIN cte c ON c.subject_id = s.subject_id AND c.min_previous_subject_id != s.previous_subject_id;
SELECT * FROM @subject_class;
首先确定每个 subject_id
的最小值 previous_subject_id
并假设这是我们想要保留的唯一值。还有很多其他方法可以做到这一点,您可以选择最高值,或者想出一些更复杂的规则。
这并没有给你你所要求的,而是你得到的结果:
subject_id subject_name standard rank previous_subject_id
13 ABC 1st 1 21
25 def 3rd 6 28
但是,这是确定性的,因为每次 运行 查询都会得到相同的结果。
您希望查询仅删除在 "other" 字段上匹配的行,所以这里是:
DECLARE @subject_class TABLE (
subject_id INT,
subject_name VARCHAR(20),
[standard] VARCHAR(20),
[rank] INT,
previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABF', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;
INSERT INTO @subject_class SELECT 25, 'dez', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
DELETE FROM @subject_class WHERE subject_id = previous_subject_id;
WITH cte AS (
SELECT
subject_id,
subject_name,
[standard],
[rank],
MIN(previous_subject_id) AS min_previous_subject_id
FROM
@subject_class
GROUP BY
subject_id,
subject_name,
[standard],
[rank])
DELETE
s
FROM
@subject_class s
INNER JOIN cte c ON c.subject_id = s.subject_id
AND c.subject_name = s.subject_name
AND c.[standard] = s.[standard]
AND c.[rank] = s.[rank]
WHERE
c.min_previous_subject_id != s.previous_subject_id;
SELECT * FROM @subject_class;
这次我们得到 3 行:
- "dez" 行仍然被删除,因为它具有相同的 subject_id 和 previous_subject_id;
- 保留 "ABF" 的行,因为它与主题名称不匹配。
这次使用您的更新数据:
DECLARE @subject_class TABLE (
subject_id INT,
subject_name VARCHAR(20),
[standard] VARCHAR(20),
[rank] INT,
previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
INSERT INTO @subject_class SELECT 25, 'XYZ', '2nd', 7, 26;
INSERT INTO @subject_class SELECT 29, 'PQR', '1st', 1, 31;
DELETE FROM @subject_class WHERE subject_id = previous_subject_id;
WITH cte AS (
SELECT
subject_id,
subject_name,
[standard],
[rank],
MIN(previous_subject_id) AS min_previous_subject_id
FROM
@subject_class
GROUP BY
subject_id,
subject_name,
[standard],
[rank])
DELETE
s
FROM
@subject_class s
INNER JOIN cte c ON c.subject_id = s.subject_id
AND c.subject_name = s.subject_name
AND c.[standard] = s.[standard]
AND c.[rank] = s.[rank]
WHERE
c.min_previous_subject_id != s.previous_subject_id;
SELECT * FROM @subject_class;
我得到以下结果:
subject_id subject_name standard rank previous_subject_id
13 ABC 1st 1 21
25 def 3rd 6 28
25 XYZ 2nd 7 26
29 PQR 1st 1 31
哪个符合您的预期?不完全是,但那是因为当没有这样的概念时,您仍在使用 "first"。我得到相同的行数,结果基本相同。我只是选择了与你不同的行来保留。
我有一个table如下图
Subject_id subject_name Standard Rank Previous_subject_id
13 ABC 1st 1 21
13 ABC 1st 1 23
13 ABC 1st 1 13
25 def 3rd 6 42
25 def 3rd 6 25
25 def 3rd 6 28
25 XYZ 2nd 7 26
29 PQR 1st 1 31
如您所见,除 previous_subject_id
列(一行)外,所有列和值都相同。
规则 1
如果规则1之后还有重复项,我想做的是删除所有满足subject_id = previous_subject_id
?
规则 2
如果还有重复的subject_ids,则只保留第一个(出现的)记录
正如您在下面的示例输出中看到的,我只保留了第一个出现的记录。
我希望输出如下所示
Subject_id subject_name Standard Rank Previous_subject_id
13 ABC 1st 1 21
25 def 3rd 6 42
25 XYZ 2nd 7 26
29 PQR 1st 1 31
唯一的问题是我的 table 有 285000 条记录并且没有编入索引。删除记录后,我将能够将索引设置为 subject_id
,因为它们变得唯一。
这是我试过的
select * from subject_class a
inner join
subject_class b
on a.subject_id = b.previous_subject_id
虽然上面的查询由于索引的问题让运行保持了很长时间,请问有什么有效的方法吗?
但是我该如何放下它们呢?
可以帮我解决这个问题吗?
我不明白你为什么要使用 JOIN
,虽然这看起来很简单:
DELETE FROM subject_class WHERE subject_id = previous_subject_id
?
另外,285,000行也不算多,性能应该没问题。但是,285,000 * 285,000(810 亿)是一个很大的数字,这基本上就是您使用 JOIN
的查询必须解决的问题。
好的,现在我们有问题了。在关系数据库中,没有 "first" 或 "last" 的概念。行没有任何固有的顺序,除非你告诉他们一些东西来排序。在您的示例中,您已经直观地选择了两行以从列表中保留,纯粹是基于当您列出它们时,这是它们出现的顺序。但是,该顺序是完全不确定的。这实际上可能是数据插入堆的顺序(非索引 table),但这几乎不可能复制,并且超出了这个问题的范围。
我能做的是提供一种确定性的方法来删除行。因为这个比较复杂,我先设置一些测试数据:
DECLARE @subject_class TABLE (
subject_id INT,
subject_name VARCHAR(20),
[standard] VARCHAR(20),
[rank] INT,
previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
这基本上就是您的设置,您在没有索引的 table 中列出的数据。
第一部分很简单:
DELETE FROM @subject_class WHERE subject_id = previous_subject_id; --fixes 2 records
第二部分稍微复杂一些,所以我使用了一个常见的table表达式:
WITH cte AS (
SELECT
subject_id,
MIN(previous_subject_id) AS min_previous_subject_id
FROM
@subject_class
GROUP BY
subject_id)
DELETE
s
FROM
@subject_class s
INNER JOIN cte c ON c.subject_id = s.subject_id AND c.min_previous_subject_id != s.previous_subject_id;
SELECT * FROM @subject_class;
首先确定每个 subject_id
的最小值 previous_subject_id
并假设这是我们想要保留的唯一值。还有很多其他方法可以做到这一点,您可以选择最高值,或者想出一些更复杂的规则。
这并没有给你你所要求的,而是你得到的结果:
subject_id subject_name standard rank previous_subject_id
13 ABC 1st 1 21
25 def 3rd 6 28
但是,这是确定性的,因为每次 运行 查询都会得到相同的结果。
您希望查询仅删除在 "other" 字段上匹配的行,所以这里是:
DECLARE @subject_class TABLE (
subject_id INT,
subject_name VARCHAR(20),
[standard] VARCHAR(20),
[rank] INT,
previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABF', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;
INSERT INTO @subject_class SELECT 25, 'dez', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
DELETE FROM @subject_class WHERE subject_id = previous_subject_id;
WITH cte AS (
SELECT
subject_id,
subject_name,
[standard],
[rank],
MIN(previous_subject_id) AS min_previous_subject_id
FROM
@subject_class
GROUP BY
subject_id,
subject_name,
[standard],
[rank])
DELETE
s
FROM
@subject_class s
INNER JOIN cte c ON c.subject_id = s.subject_id
AND c.subject_name = s.subject_name
AND c.[standard] = s.[standard]
AND c.[rank] = s.[rank]
WHERE
c.min_previous_subject_id != s.previous_subject_id;
SELECT * FROM @subject_class;
这次我们得到 3 行: - "dez" 行仍然被删除,因为它具有相同的 subject_id 和 previous_subject_id; - 保留 "ABF" 的行,因为它与主题名称不匹配。
这次使用您的更新数据:
DECLARE @subject_class TABLE (
subject_id INT,
subject_name VARCHAR(20),
[standard] VARCHAR(20),
[rank] INT,
previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
INSERT INTO @subject_class SELECT 25, 'XYZ', '2nd', 7, 26;
INSERT INTO @subject_class SELECT 29, 'PQR', '1st', 1, 31;
DELETE FROM @subject_class WHERE subject_id = previous_subject_id;
WITH cte AS (
SELECT
subject_id,
subject_name,
[standard],
[rank],
MIN(previous_subject_id) AS min_previous_subject_id
FROM
@subject_class
GROUP BY
subject_id,
subject_name,
[standard],
[rank])
DELETE
s
FROM
@subject_class s
INNER JOIN cte c ON c.subject_id = s.subject_id
AND c.subject_name = s.subject_name
AND c.[standard] = s.[standard]
AND c.[rank] = s.[rank]
WHERE
c.min_previous_subject_id != s.previous_subject_id;
SELECT * FROM @subject_class;
我得到以下结果:
subject_id subject_name standard rank previous_subject_id
13 ABC 1st 1 21
25 def 3rd 6 28
25 XYZ 2nd 7 26
29 PQR 1st 1 31
哪个符合您的预期?不完全是,但那是因为当没有这样的概念时,您仍在使用 "first"。我得到相同的行数,结果基本相同。我只是选择了与你不同的行来保留。