从 EC2 迁移到 RDS 后,Postgresql 查询速度降低 10 倍
Postgresql Query 10x Slower After Migration from EC2 to RDS
我们将 postgresql 数据库从 EC2(Docker 上的第 12.1 运行 页迁移到 RDS(第 12.6 页)。然后我们注意到一些查询变得非常慢(慢 10 倍)
这是我们的一个查询
SELECT cp."FirstName" ,
cp."LastName" ,
cp."DealerGroupId" ,
count(*) AS "DuplicateCount"
FROM "sc_CustomerProfiles" cp
WHERE (cp."FirstName" IS NOT NULL
OR cp."LastName" IS NOT NULL)
AND cp."UpdatedDate" > '2020-07-01'
AND EXISTS
(SELECT 1
FROM "sc_CustomerProfiles" scp
WHERE scp."FirstName" = cp."FirstName"
AND cp."LastName" = scp."LastName"
AND cp."DealerGroupId" = scp."DealerGroupId"
AND scp."ProfileId" < 0 )
GROUP BY cp."FirstName" ,
cp."LastName" ,
cp."DealerGroupId"
HAVING count(*) > 1
运行 EXPLAIN ANALYZE 在我们 EC2 上的旧数据库上给出以下结果
Finalize GroupAggregate (cost=818304.54..922603.67 rows=196075 width=61) (actual time=1679.259..1931.629 rows=623 loops=1)
Group Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
Filter: (count(*) > 1)
Rows Removed by Filter: 2257
-> Gather Merge (cost=818304.54..906894.61 rows=668500 width=61) (actual time=1678.763..1934.877 rows=3290 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=817304.52..828733.10 rows=334250 width=61) (actual time=1637.652..1886.456 rows=1097 loops=3)
Group Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
-> Merge Semi Join (cost=817304.52..822048.10 rows=334250 width=53) (actual time=1637.597..1886.015 rows=1212 loops=3)
Merge Cond: (((cp."FirstName")::text = (scp."FirstName")::text) AND ((cp."LastName")::text = (scp."LastName")::text) AND (cp."DealerGroupId" = scp."DealerGroupId"))
-> Sort (cost=564987.54..565957.09 rows=387821 width=53) (actual time=1632.503..1841.309 rows=284808 loops=3)
Sort Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
Sort Method: external merge Disk: 18248kB
Worker 0: Sort Method: external merge Disk: 18720kB
Worker 1: Sort Method: external merge Disk: 18720kB
-> Parallel Seq Scan on "sc_CustomerProfiles" cp (cost=0.00..515729.99 rows=387821 width=53) (actual time=575.396..1171.259 rows=284808 loops=3)
Filter: ((("FirstName" IS NOT NULL) OR ("LastName" IS NOT NULL)) AND ("UpdatedDate" > '2020-07-01 00:00:00+07'::timestamp with time zone))
Rows Removed by Filter: 2613490
-> Sort (cost=252316.98..252533.20 rows=86489 width=53) (actual time=4.940..5.162 rows=2937 loops=3)
Sort Key: scp."FirstName", scp."LastName", scp."DealerGroupId"
Sort Method: quicksort Memory: 440kB
Worker 0: Sort Method: quicksort Memory: 440kB
Worker 1: Sort Method: quicksort Memory: 440kB
-> Index Scan using "sc_CustomerProfiles_ProfileId" on "sc_CustomerProfiles" scp (cost=0.43..242267.28 rows=86489 width=53) (actual time=0.018..1.700 rows=3055 loops=3)
Index Cond: ("ProfileId" < 0)
Planning Time: 1.337 ms
JIT:
Functions: 79
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 16.630 ms, Inlining 216.395 ms, Optimization 990.256 ms, Emission 518.330 ms, Total 1741.611 ms
Execution Time: 1992.259 ms
虽然 运行 在我们 RDS 上的新数据库上给出了这个结果
Finalize GroupAggregate (cost=744995.34..848665.34 rows=195480 width=61) (actual time=144257.571..194501.899 rows=621 loops=1)
Group Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
Filter: (count(*) > 1)
Rows Removed by Filter: 2261
-> Gather Merge (cost=744995.34..833031.15 rows=664296 width=61) (actual time=144214.280..194498.590 rows=3190 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=743995.31..755354.88 rows=332148 width=61) (actual time=139429.298..187940.480 rows=1063 loops=3)
Group Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
-> Merge Semi Join (cost=743995.31..748711.92 rows=332148 width=53) (actual time=139405.473..187938.320 rows=1212 loops=3)
Merge Cond: (((cp."FirstName")::text = (scp."FirstName")::text) AND ((cp."LastName")::text = (scp."LastName")::text) AND (cp."DealerGroupId" = scp."DealerGroupId"))
-> Sort (cost=493373.41..494335.55 rows=384857 width=53) (actual time=138424.282..182706.702 rows=285254 loops=3)
Sort Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
Sort Method: external merge Disk: 19464kB
Worker 0: Sort Method: external merge Disk: 17672kB
Worker 1: Sort Method: external merge Disk: 18616kB
-> Parallel Seq Scan on "sc_CustomerProfiles" cp (cost=0.00..444513.80 rows=384857 width=53) (actual time=0.048..1509.801 rows=285255 loops=3)
Filter: ((("FirstName" IS NOT NULL) OR ("LastName" IS NOT NULL)) AND ("UpdatedDate" > '2020-07-01 00:00:00+07'::timestamp with time zone))
Rows Removed by Filter: 2613405
-> Sort (cost=250621.90..250838.81 rows=86762 width=53) (actual time=977.557..978.400 rows=2940 loops=3)
Sort Key: scp."FirstName", scp."LastName", scp."DealerGroupId"
Sort Method: quicksort Memory: 441kB
Worker 0: Sort Method: quicksort Memory: 441kB
Worker 1: Sort Method: quicksort Memory: 441kB
-> Index Scan using "sc_CustomerProfiles_ProfileId" on "sc_CustomerProfiles" scp (cost=0.43..240537.35 rows=86762 width=53) (actual time=0.079..3.373 rows=3057 loops=3)
Index Cond: ("ProfileId" < 0)
Planning Time: 31.569 ms
Execution Time: 194505.100 ms
我注意到 'scan' 部分的持续时间相似,但 'sort' 部分相距甚远。实例规格不一样,但我没有看到高 CPU 或内存利用率。这可能是什么根本原因,或者我应该如何调查这个问题?
解决方案
原来根本原因是 glibc 整理函数中仅针对泰语的错误 (https://sourceware.org/bugzilla/show_bug.cgi?id=18441)。这在 EC2 上起作用的原因是我们使用的是 postgres alpine docker 图像,它使用 musl 而不是 glibc。
由于我们实际上不需要用泰语对列进行排序,将 LC_COLLATE 更改为 'C' 确实可以解决问题。请注意,如果需要,我们也可以使用 ICU 整理。
差异可能是 I/O 写入和读取排序 300000 行所需的临时文件时的速度。
为了确定,将参数track_io_timing
改为on
,使用EXPLAIN (ANALYZE, BUFFERS)
。这也会向您显示 I/O 次。
为避免临时文件,您可以增加 work_mem
(尝试 100MB 左右),这应该会显着提高性能。
这个索引也可能是有益的:
CREATE INDEX ON "sc_CustomerProfiles" ("UpdatedDate")
WHERE "FirstName" IS NOT NULL OR "LastName" IS NOT NULL;
(如果"FirstName"
和"LastName"
通常不为NULL,可以省略WHERE
子句。)
原来根本原因是 glibc 整理函数中仅针对泰语的错误 (https://sourceware.org/bugzilla/show_bug.cgi?id=18441)。这在 EC2 上起作用的原因是我们使用的是 postgres alpine docker 图像,它使用 musl 而不是 glibc。
由于我们实际上不需要用泰语对列进行排序,将 LC_COLLATE 更改为 'C' 确实可以解决问题。请注意,如果需要,我们也可以使用 ICU 整理。
我们将 postgresql 数据库从 EC2(Docker 上的第 12.1 运行 页迁移到 RDS(第 12.6 页)。然后我们注意到一些查询变得非常慢(慢 10 倍)
这是我们的一个查询
SELECT cp."FirstName" ,
cp."LastName" ,
cp."DealerGroupId" ,
count(*) AS "DuplicateCount"
FROM "sc_CustomerProfiles" cp
WHERE (cp."FirstName" IS NOT NULL
OR cp."LastName" IS NOT NULL)
AND cp."UpdatedDate" > '2020-07-01'
AND EXISTS
(SELECT 1
FROM "sc_CustomerProfiles" scp
WHERE scp."FirstName" = cp."FirstName"
AND cp."LastName" = scp."LastName"
AND cp."DealerGroupId" = scp."DealerGroupId"
AND scp."ProfileId" < 0 )
GROUP BY cp."FirstName" ,
cp."LastName" ,
cp."DealerGroupId"
HAVING count(*) > 1
运行 EXPLAIN ANALYZE 在我们 EC2 上的旧数据库上给出以下结果
Finalize GroupAggregate (cost=818304.54..922603.67 rows=196075 width=61) (actual time=1679.259..1931.629 rows=623 loops=1)
Group Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
Filter: (count(*) > 1)
Rows Removed by Filter: 2257
-> Gather Merge (cost=818304.54..906894.61 rows=668500 width=61) (actual time=1678.763..1934.877 rows=3290 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=817304.52..828733.10 rows=334250 width=61) (actual time=1637.652..1886.456 rows=1097 loops=3)
Group Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
-> Merge Semi Join (cost=817304.52..822048.10 rows=334250 width=53) (actual time=1637.597..1886.015 rows=1212 loops=3)
Merge Cond: (((cp."FirstName")::text = (scp."FirstName")::text) AND ((cp."LastName")::text = (scp."LastName")::text) AND (cp."DealerGroupId" = scp."DealerGroupId"))
-> Sort (cost=564987.54..565957.09 rows=387821 width=53) (actual time=1632.503..1841.309 rows=284808 loops=3)
Sort Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
Sort Method: external merge Disk: 18248kB
Worker 0: Sort Method: external merge Disk: 18720kB
Worker 1: Sort Method: external merge Disk: 18720kB
-> Parallel Seq Scan on "sc_CustomerProfiles" cp (cost=0.00..515729.99 rows=387821 width=53) (actual time=575.396..1171.259 rows=284808 loops=3)
Filter: ((("FirstName" IS NOT NULL) OR ("LastName" IS NOT NULL)) AND ("UpdatedDate" > '2020-07-01 00:00:00+07'::timestamp with time zone))
Rows Removed by Filter: 2613490
-> Sort (cost=252316.98..252533.20 rows=86489 width=53) (actual time=4.940..5.162 rows=2937 loops=3)
Sort Key: scp."FirstName", scp."LastName", scp."DealerGroupId"
Sort Method: quicksort Memory: 440kB
Worker 0: Sort Method: quicksort Memory: 440kB
Worker 1: Sort Method: quicksort Memory: 440kB
-> Index Scan using "sc_CustomerProfiles_ProfileId" on "sc_CustomerProfiles" scp (cost=0.43..242267.28 rows=86489 width=53) (actual time=0.018..1.700 rows=3055 loops=3)
Index Cond: ("ProfileId" < 0)
Planning Time: 1.337 ms
JIT:
Functions: 79
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 16.630 ms, Inlining 216.395 ms, Optimization 990.256 ms, Emission 518.330 ms, Total 1741.611 ms
Execution Time: 1992.259 ms
虽然 运行 在我们 RDS 上的新数据库上给出了这个结果
Finalize GroupAggregate (cost=744995.34..848665.34 rows=195480 width=61) (actual time=144257.571..194501.899 rows=621 loops=1)
Group Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
Filter: (count(*) > 1)
Rows Removed by Filter: 2261
-> Gather Merge (cost=744995.34..833031.15 rows=664296 width=61) (actual time=144214.280..194498.590 rows=3190 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=743995.31..755354.88 rows=332148 width=61) (actual time=139429.298..187940.480 rows=1063 loops=3)
Group Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
-> Merge Semi Join (cost=743995.31..748711.92 rows=332148 width=53) (actual time=139405.473..187938.320 rows=1212 loops=3)
Merge Cond: (((cp."FirstName")::text = (scp."FirstName")::text) AND ((cp."LastName")::text = (scp."LastName")::text) AND (cp."DealerGroupId" = scp."DealerGroupId"))
-> Sort (cost=493373.41..494335.55 rows=384857 width=53) (actual time=138424.282..182706.702 rows=285254 loops=3)
Sort Key: cp."FirstName", cp."LastName", cp."DealerGroupId"
Sort Method: external merge Disk: 19464kB
Worker 0: Sort Method: external merge Disk: 17672kB
Worker 1: Sort Method: external merge Disk: 18616kB
-> Parallel Seq Scan on "sc_CustomerProfiles" cp (cost=0.00..444513.80 rows=384857 width=53) (actual time=0.048..1509.801 rows=285255 loops=3)
Filter: ((("FirstName" IS NOT NULL) OR ("LastName" IS NOT NULL)) AND ("UpdatedDate" > '2020-07-01 00:00:00+07'::timestamp with time zone))
Rows Removed by Filter: 2613405
-> Sort (cost=250621.90..250838.81 rows=86762 width=53) (actual time=977.557..978.400 rows=2940 loops=3)
Sort Key: scp."FirstName", scp."LastName", scp."DealerGroupId"
Sort Method: quicksort Memory: 441kB
Worker 0: Sort Method: quicksort Memory: 441kB
Worker 1: Sort Method: quicksort Memory: 441kB
-> Index Scan using "sc_CustomerProfiles_ProfileId" on "sc_CustomerProfiles" scp (cost=0.43..240537.35 rows=86762 width=53) (actual time=0.079..3.373 rows=3057 loops=3)
Index Cond: ("ProfileId" < 0)
Planning Time: 31.569 ms
Execution Time: 194505.100 ms
我注意到 'scan' 部分的持续时间相似,但 'sort' 部分相距甚远。实例规格不一样,但我没有看到高 CPU 或内存利用率。这可能是什么根本原因,或者我应该如何调查这个问题?
解决方案
原来根本原因是 glibc 整理函数中仅针对泰语的错误 (https://sourceware.org/bugzilla/show_bug.cgi?id=18441)。这在 EC2 上起作用的原因是我们使用的是 postgres alpine docker 图像,它使用 musl 而不是 glibc。
由于我们实际上不需要用泰语对列进行排序,将 LC_COLLATE 更改为 'C' 确实可以解决问题。请注意,如果需要,我们也可以使用 ICU 整理。
差异可能是 I/O 写入和读取排序 300000 行所需的临时文件时的速度。
为了确定,将参数track_io_timing
改为on
,使用EXPLAIN (ANALYZE, BUFFERS)
。这也会向您显示 I/O 次。
为避免临时文件,您可以增加 work_mem
(尝试 100MB 左右),这应该会显着提高性能。
这个索引也可能是有益的:
CREATE INDEX ON "sc_CustomerProfiles" ("UpdatedDate")
WHERE "FirstName" IS NOT NULL OR "LastName" IS NOT NULL;
(如果"FirstName"
和"LastName"
通常不为NULL,可以省略WHERE
子句。)
原来根本原因是 glibc 整理函数中仅针对泰语的错误 (https://sourceware.org/bugzilla/show_bug.cgi?id=18441)。这在 EC2 上起作用的原因是我们使用的是 postgres alpine docker 图像,它使用 musl 而不是 glibc。
由于我们实际上不需要用泰语对列进行排序,将 LC_COLLATE 更改为 'C' 确实可以解决问题。请注意,如果需要,我们也可以使用 ICU 整理。