从 EC2 迁移到 RDS 后，Postgresql 查询速度降低 10 倍

Question

我们将 postgresql 数据库从 EC2（Docker 上的第 12.1 运行页迁移到 RDS（第 12.6 页）。然后我们注意到一些查询变得非常慢（慢 10 倍）

这是我们的一个查询

SELECT cp."FirstName" ,
       cp."LastName" ,
       cp."DealerGroupId" ,
       count(*) AS "DuplicateCount"
FROM "sc_CustomerProfiles" cp
WHERE (cp."FirstName" IS NOT NULL
       OR cp."LastName" IS NOT NULL)
  AND cp."UpdatedDate" > '2020-07-01'
  AND EXISTS
    (SELECT 1
     FROM "sc_CustomerProfiles" scp
     WHERE scp."FirstName" = cp."FirstName"
       AND cp."LastName" = scp."LastName"
       AND cp."DealerGroupId" = scp."DealerGroupId"
       AND scp."ProfileId" < 0 )
GROUP BY cp."FirstName" ,
         cp."LastName" ,
         cp."DealerGroupId"
HAVING count(*) > 1

运行 EXPLAIN ANALYZE 在我们 EC2 上的旧数据库上给出以下结果

Finalize GroupAggregate  (cost=818304.54..922603.67 rows=196075 width=61) (actual time=1679.259..1931.629 rows=623 loops=1)
  Group Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
  Filter: (count(*) > 1)
  Rows Removed by Filter: 2257
  ->  Gather Merge  (cost=818304.54..906894.61 rows=668500 width=61) (actual time=1678.763..1934.877 rows=3290 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial GroupAggregate  (cost=817304.52..828733.10 rows=334250 width=61) (actual time=1637.652..1886.456 rows=1097 loops=3)
              Group Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
              ->  Merge Semi Join  (cost=817304.52..822048.10 rows=334250 width=53) (actual time=1637.597..1886.015 rows=1212 loops=3)
                    Merge Cond: (((cp."FirstName")::text = (scp."FirstName")::text) AND ((cp."LastName")::text = (scp."LastName")::text) AND (cp."DealerGroupId" = scp."DealerGroupId"))
                    ->  Sort  (cost=564987.54..565957.09 rows=387821 width=53) (actual time=1632.503..1841.309 rows=284808 loops=3)
                          Sort Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
                          Sort Method: external merge  Disk: 18248kB
                          Worker 0:  Sort Method: external merge  Disk: 18720kB
                          Worker 1:  Sort Method: external merge  Disk: 18720kB
                          ->  Parallel Seq Scan on "sc_CustomerProfiles" cp  (cost=0.00..515729.99 rows=387821 width=53) (actual time=575.396..1171.259 rows=284808 loops=3)
                                Filter: ((("FirstName" IS NOT NULL) OR ("LastName" IS NOT NULL)) AND ("UpdatedDate" > '2020-07-01 00:00:00+07'::timestamp with time zone))
                                Rows Removed by Filter: 2613490
                    ->  Sort  (cost=252316.98..252533.20 rows=86489 width=53) (actual time=4.940..5.162 rows=2937 loops=3)
                          Sort Key: scp."FirstName", scp."LastName", scp."DealerGroupId"
                          Sort Method: quicksort  Memory: 440kB
                          Worker 0:  Sort Method: quicksort  Memory: 440kB
                          Worker 1:  Sort Method: quicksort  Memory: 440kB
                          ->  Index Scan using "sc_CustomerProfiles_ProfileId" on "sc_CustomerProfiles" scp  (cost=0.43..242267.28 rows=86489 width=53) (actual time=0.018..1.700 rows=3055 loops=3)
                                Index Cond: ("ProfileId" < 0)
Planning Time: 1.337 ms
JIT:
  Functions: 79
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 16.630 ms, Inlining 216.395 ms, Optimization 990.256 ms, Emission 518.330 ms, Total 1741.611 ms
Execution Time: 1992.259 ms

虽然运行在我们 RDS 上的新数据库上给出了这个结果

Finalize GroupAggregate  (cost=744995.34..848665.34 rows=195480 width=61) (actual time=144257.571..194501.899 rows=621 loops=1)
  Group Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
  Filter: (count(*) > 1)
  Rows Removed by Filter: 2261
  ->  Gather Merge  (cost=744995.34..833031.15 rows=664296 width=61) (actual time=144214.280..194498.590 rows=3190 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial GroupAggregate  (cost=743995.31..755354.88 rows=332148 width=61) (actual time=139429.298..187940.480 rows=1063 loops=3)
              Group Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
              ->  Merge Semi Join  (cost=743995.31..748711.92 rows=332148 width=53) (actual time=139405.473..187938.320 rows=1212 loops=3)
                    Merge Cond: (((cp."FirstName")::text = (scp."FirstName")::text) AND ((cp."LastName")::text = (scp."LastName")::text) AND (cp."DealerGroupId" = scp."DealerGroupId"))
                    ->  Sort  (cost=493373.41..494335.55 rows=384857 width=53) (actual time=138424.282..182706.702 rows=285254 loops=3)
                          Sort Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
                          Sort Method: external merge  Disk: 19464kB
                          Worker 0:  Sort Method: external merge  Disk: 17672kB
                          Worker 1:  Sort Method: external merge  Disk: 18616kB
                          ->  Parallel Seq Scan on "sc_CustomerProfiles" cp  (cost=0.00..444513.80 rows=384857 width=53) (actual time=0.048..1509.801 rows=285255 loops=3)
                                Filter: ((("FirstName" IS NOT NULL) OR ("LastName" IS NOT NULL)) AND ("UpdatedDate" > '2020-07-01 00:00:00+07'::timestamp with time zone))
                                Rows Removed by Filter: 2613405
                    ->  Sort  (cost=250621.90..250838.81 rows=86762 width=53) (actual time=977.557..978.400 rows=2940 loops=3)
                          Sort Key: scp."FirstName", scp."LastName", scp."DealerGroupId"
                          Sort Method: quicksort  Memory: 441kB
                          Worker 0:  Sort Method: quicksort  Memory: 441kB
                          Worker 1:  Sort Method: quicksort  Memory: 441kB
                          ->  Index Scan using "sc_CustomerProfiles_ProfileId" on "sc_CustomerProfiles" scp  (cost=0.43..240537.35 rows=86762 width=53) (actual time=0.079..3.373 rows=3057 loops=3)
                                Index Cond: ("ProfileId" < 0)
Planning Time: 31.569 ms
Execution Time: 194505.100 ms

我注意到 'scan' 部分的持续时间相似，但 'sort' 部分相距甚远。实例规格不一样，但我没有看到高 CPU 或内存利用率。这可能是什么根本原因，或者我应该如何调查这个问题？

解决方案

原来根本原因是 glibc 整理函数中仅针对泰语的错误 (https://sourceware.org/bugzilla/show_bug.cgi?id=18441)。这在 EC2 上起作用的原因是我们使用的是 postgres alpine docker 图像，它使用 musl 而不是 glibc。

由于我们实际上不需要用泰语对列进行排序，将 LC_COLLATE 更改为 'C' 确实可以解决问题。请注意，如果需要，我们也可以使用 ICU 整理。

Answer 1

差异可能是 I/O 写入和读取排序 300000 行所需的临时文件时的速度。

为了确定，将参数track_io_timing改为on，使用EXPLAIN (ANALYZE, BUFFERS)。这也会向您显示 I/O 次。

为避免临时文件，您可以增加 work_mem（尝试 100MB 左右），这应该会显着提高性能。

这个索引也可能是有益的：

CREATE INDEX ON "sc_CustomerProfiles" ("UpdatedDate")
   WHERE "FirstName" IS NOT NULL OR "LastName" IS NOT NULL;

（如果"FirstName"和"LastName"通常不为NULL，可以省略WHERE子句。）

Answer 2

原来根本原因是 glibc 整理函数中仅针对泰语的错误 (https://sourceware.org/bugzilla/show_bug.cgi?id=18441)。这在 EC2 上起作用的原因是我们使用的是 postgres alpine docker 图像，它使用 musl 而不是 glibc。

由于我们实际上不需要用泰语对列进行排序，将 LC_COLLATE 更改为 'C' 确实可以解决问题。请注意，如果需要，我们也可以使用 ICU 整理。

从 EC2 迁移到 RDS 后，Postgresql 查询速度降低 10 倍

Postgresql Query 10x Slower After Migration from EC2 to RDS

sql

postgresql

amazon-web-services

amazon-rds

解决方案