使用 postgresql 删除无索引 table 中的重复记录

Drop duplicates records in a no index table using postgresql

我有一个table如下图

Subject_id  subject_name  Standard Rank Previous_subject_id
13              ABC            1st    1          21
13              ABC            1st    1          23   
13              ABC            1st    1          13
25              def            3rd    6          42   
25              def            3rd    6          25
25              def            3rd    6          28
25              XYZ            2nd    7          26
29              PQR            1st    1          31         

如您所见,除 previous_subject_id 列(一行)外,所有列和值都相同。

规则 1

如果规则1之后还有重复项,我想做的是删除所有满足subject_id = previous_subject_id?

条件的人

规则 2

如果还有重复的subject_ids,则只保留第一个(出现的)记录

正如您在下面的示例输出中看到的,我只保留了第一个出现的记录。

我希望输出如下所示

Subject_id  subject_name  Standard Rank Previous_subject_id
13              ABC            1st    1          21
25              def            3rd    6          42
25              XYZ            2nd    7          26
29              PQR            1st    1          31  

唯一的问题是我的 table 有 285000 条记录并且没有编入索引。删除记录后,我将能够将索引设置为 subject_id,因为它们变得唯一。

这是我试过的

select * from subject_class a
inner join 
subject_class b
on a.subject_id = b.previous_subject_id

虽然上面的查询由于索引的问题让运行保持了很长时间,请问有什么有效的方法吗?

但是我该如何放下它们呢?

可以帮我解决这个问题吗?

我不明白你为什么要使用 JOIN,虽然这看起来很简单:

DELETE FROM subject_class WHERE subject_id = previous_subject_id?

另外,285,000行也不算多,性能应该没问题。但是,285,000 * 285,000(810 亿)是一个很大的数字,这基本上就是您使用 JOIN 的查询必须解决的问题。


好的,现在我们有问题了。在关系数据库中,没有 "first" 或 "last" 的概念。行没有任何固有的顺序,除非你告诉他们一些东西来排序。在您的示例中,您已经直观地选择了两行以从列表中保留,纯粹是基于当您列出它们时,这是它们出现的顺序。但是,该顺序是完全不确定的。这实际上可能是数据插入堆的顺序(非索引 table),但这几乎不可能复制,并且超出了这个问题的范围。

我能做的是提供一种确定性的方法来删除行。因为这个比较复杂,我先设置一些测试数据:

DECLARE @subject_class TABLE (
    subject_id INT,
    subject_name VARCHAR(20),
    [standard] VARCHAR(20),
    [rank] INT,
    previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;  
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;

这基本上就是您的设置,您在没有索引的 table 中列出的数据。

第一部分很简单:

DELETE FROM @subject_class WHERE subject_id = previous_subject_id; --fixes 2 records

第二部分稍微复杂一些,所以我使用了一个常见的table表达式:

WITH cte AS (
    SELECT
        subject_id,
        MIN(previous_subject_id) AS min_previous_subject_id
    FROM
        @subject_class
    GROUP BY
        subject_id)
DELETE
    s
FROM
    @subject_class s
    INNER JOIN cte c ON c.subject_id = s.subject_id AND c.min_previous_subject_id != s.previous_subject_id;
SELECT * FROM @subject_class;

首先确定每个 subject_id 的最小值 previous_subject_id 并假设这是我们想要保留的唯一值。还有很多其他方法可以做到这一点,您可以选择最高值,或者想出一些更复杂的规则。

这并没有给你你所要求的,而是你得到的结果:

subject_id  subject_name    standard    rank    previous_subject_id
13          ABC             1st         1       21
25          def             3rd         6       28

但是,这是确定性的,因为每次 运行 查询都会得到相同的结果。


您希望查询仅删除在 "other" 字段上匹配的行,所以这里是:

DECLARE @subject_class TABLE (
    subject_id INT,
    subject_name VARCHAR(20),
    [standard] VARCHAR(20),
    [rank] INT,
    previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABF', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;  
INSERT INTO @subject_class SELECT 25, 'dez', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;

DELETE FROM @subject_class WHERE subject_id = previous_subject_id;

WITH cte AS (
    SELECT
        subject_id,
        subject_name,
        [standard],
        [rank],
        MIN(previous_subject_id) AS min_previous_subject_id
    FROM
        @subject_class
    GROUP BY
        subject_id,
        subject_name,
        [standard],
        [rank])
DELETE
    s
FROM
    @subject_class s
    INNER JOIN cte c ON c.subject_id = s.subject_id 
        AND c.subject_name = s.subject_name 
        AND c.[standard] = s.[standard]
        AND c.[rank] = s.[rank]
WHERE
    c.min_previous_subject_id != s.previous_subject_id;

SELECT * FROM @subject_class;

这次我们得到 3 行: - "dez" 行仍然被删除,因为它具有相同的 subject_id 和 previous_subject_id; - 保留 "ABF" 的行,因为它与主题名称不匹配。


这次使用您的更新数据:

DECLARE @subject_class TABLE (
    subject_id INT,
    subject_name VARCHAR(20),
    [standard] VARCHAR(20),
    [rank] INT,
    previous_subject_id INT);
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 21;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 23;
INSERT INTO @subject_class SELECT 13, 'ABC', '1st', 1, 13;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 42;  
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 25;
INSERT INTO @subject_class SELECT 25, 'def', '3rd', 6, 28;
INSERT INTO @subject_class SELECT 25, 'XYZ', '2nd', 7, 26;
INSERT INTO @subject_class SELECT 29, 'PQR', '1st', 1, 31;

DELETE FROM @subject_class WHERE subject_id = previous_subject_id;

WITH cte AS (
    SELECT
        subject_id,
        subject_name,
        [standard],
        [rank],
        MIN(previous_subject_id) AS min_previous_subject_id
    FROM
        @subject_class
    GROUP BY
        subject_id,
        subject_name,
        [standard],
        [rank])
DELETE
    s
FROM
    @subject_class s
    INNER JOIN cte c ON c.subject_id = s.subject_id 
        AND c.subject_name = s.subject_name 
        AND c.[standard] = s.[standard]
        AND c.[rank] = s.[rank]
WHERE
    c.min_previous_subject_id != s.previous_subject_id;

SELECT * FROM @subject_class;

我得到以下结果:

subject_id  subject_name    standard    rank    previous_subject_id
13          ABC             1st         1       21
25          def             3rd         6       28
25          XYZ             2nd         7       26
29          PQR             1st         1       31

哪个符合您的预期?不完全是,但那是因为当没有这样的概念时,您仍在使用 "first"。我得到相同的行数,结果基本相同。我只是选择了与你不同的行来保留。