使用 window 函数加速缓慢的 Postgres 查询
Speed up slow Postgres query with window functions
我正在尝试优化查询,因为我的 ORM (Django) 生成的查询导致超时。我已经在 ORM 中尽一切可能 运行 它作为一个查询,所以现在我想知道是否有任何可以加快速度的 Postgres 技巧。
数据库包含 1m+ 和不断增长的关系(id、源和目标),我需要过滤这些关系以排除源未出现至少 2 次的连接。
这是当前查询 - "target" 个 ID 列表可能会增长,从而导致速度呈指数级下降。
SELECT * FROM
(SELECT
"source",
"target",
count("id") OVER (PARTITION BY "source") AS "count_match"
FROM
"database_name"
WHERE
("database_name"."target" IN (123, 456, 789))
) AS temp_data WHERE "temp_data"."count_match" >= 2
我读过有关 VIEWS
和临时 TABLES
的内容,但这似乎是一次性查询的大量设置和拆卸。
编辑:关于更高内存的更多信息和测试
EXPLAIN ANALYSE
的结果:
Subquery Scan on alias_test (cost=622312.29..728296.62 rows=1177604 width=24) (actual time=10245.731..18019.237 rows=1604749 loops=1)
Filter: (alias_test.count_match >= 2)
Rows Removed by Filter: 2002738
-> WindowAgg (cost=622312.29..684136.48 rows=3532811 width=20) (actual time=10245.687..16887.428 rows=3607487 loops=1)
-> Sort (cost=622312.29..631144.32 rows=3532811 width=20) (actual time=10245.630..12455.796 rows=3607487 loops=1)
Sort Key: database_name.source
Sort Method: external merge Disk: 105792kB
-> Bitmap Heap Scan on database_name (cost=60934.74..238076.96 rows=3532811 width=20) (actual time=352.529..1900.162 rows=3607487 loops=1)
Recheck Cond: (target = ANY ('{5495502,80455548,10129504,2052517,11564026,1509187,1981101,1410001}'::bigint[]))
Heap Blocks: exact=33716
-> Bitmap Index Scan on database_name_target_426d2f46_uniq (cost=0.00..60051.54 rows=3532811 width=0) (actual time=336.457..336.457 rows=3607487 loops=1)
Index Cond: (target = ANY ('{5495502,80455548,10129504,2052517,11564026,1509187,1981101,1410001}'::bigint[]))
Planning time: 0.288 ms
Execution time: 18318.194 ms
Table结构:
Column | Type | Modifiers
---------------+--------------------------+-----------------------------------------------------------------------------------
created_date | timestamp with time zone | not null
modified_date | timestamp with time zone | not null
id | integer | not null default nextval('database_name_id_seq'::regclass)
source | bigint | not null
target | bigint | not null
active | boolean | not null
Indexes:
"database_name_pkey" PRIMARY KEY, btree (id)
"database_name_source_24c75675_uniq" btree (source)
"database_name_target_426d2f46_uniq" btree (target)
硬件:
我尝试将服务器功率增加到 8GB 内存实例,并使用 PGTune 中的以下内容更新了 .conf
文件:
max_connections = 10
shared_buffers = 2GB
effective_cache_size = 6GB
work_mem = 209715kB
maintenance_work_mem = 512MB
min_wal_size = 1GB
max_wal_size = 2GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
尽管 work_mem
设置较高,但它仍在使用磁盘写入进行合并,这让我感到困惑。也许 window 函数导致了这种行为?
您的查询已经优化。无法避免扫描整个 table 来获取您需要的信息,而顺序扫描是最好的方法。
确保 work_mem
足够大以便可以在内存中完成聚合 - 您可以设置 log_temp_files
以监视是否使用了临时文件(这会使事情变得更慢)。
我正在尝试优化查询,因为我的 ORM (Django) 生成的查询导致超时。我已经在 ORM 中尽一切可能 运行 它作为一个查询,所以现在我想知道是否有任何可以加快速度的 Postgres 技巧。
数据库包含 1m+ 和不断增长的关系(id、源和目标),我需要过滤这些关系以排除源未出现至少 2 次的连接。
这是当前查询 - "target" 个 ID 列表可能会增长,从而导致速度呈指数级下降。
SELECT * FROM
(SELECT
"source",
"target",
count("id") OVER (PARTITION BY "source") AS "count_match"
FROM
"database_name"
WHERE
("database_name"."target" IN (123, 456, 789))
) AS temp_data WHERE "temp_data"."count_match" >= 2
我读过有关 VIEWS
和临时 TABLES
的内容,但这似乎是一次性查询的大量设置和拆卸。
编辑:关于更高内存的更多信息和测试
EXPLAIN ANALYSE
的结果:
Subquery Scan on alias_test (cost=622312.29..728296.62 rows=1177604 width=24) (actual time=10245.731..18019.237 rows=1604749 loops=1)
Filter: (alias_test.count_match >= 2)
Rows Removed by Filter: 2002738
-> WindowAgg (cost=622312.29..684136.48 rows=3532811 width=20) (actual time=10245.687..16887.428 rows=3607487 loops=1)
-> Sort (cost=622312.29..631144.32 rows=3532811 width=20) (actual time=10245.630..12455.796 rows=3607487 loops=1)
Sort Key: database_name.source
Sort Method: external merge Disk: 105792kB
-> Bitmap Heap Scan on database_name (cost=60934.74..238076.96 rows=3532811 width=20) (actual time=352.529..1900.162 rows=3607487 loops=1)
Recheck Cond: (target = ANY ('{5495502,80455548,10129504,2052517,11564026,1509187,1981101,1410001}'::bigint[]))
Heap Blocks: exact=33716
-> Bitmap Index Scan on database_name_target_426d2f46_uniq (cost=0.00..60051.54 rows=3532811 width=0) (actual time=336.457..336.457 rows=3607487 loops=1)
Index Cond: (target = ANY ('{5495502,80455548,10129504,2052517,11564026,1509187,1981101,1410001}'::bigint[]))
Planning time: 0.288 ms
Execution time: 18318.194 ms
Table结构:
Column | Type | Modifiers
---------------+--------------------------+-----------------------------------------------------------------------------------
created_date | timestamp with time zone | not null
modified_date | timestamp with time zone | not null
id | integer | not null default nextval('database_name_id_seq'::regclass)
source | bigint | not null
target | bigint | not null
active | boolean | not null
Indexes:
"database_name_pkey" PRIMARY KEY, btree (id)
"database_name_source_24c75675_uniq" btree (source)
"database_name_target_426d2f46_uniq" btree (target)
硬件:
我尝试将服务器功率增加到 8GB 内存实例,并使用 PGTune 中的以下内容更新了 .conf
文件:
max_connections = 10
shared_buffers = 2GB
effective_cache_size = 6GB
work_mem = 209715kB
maintenance_work_mem = 512MB
min_wal_size = 1GB
max_wal_size = 2GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
尽管 work_mem
设置较高,但它仍在使用磁盘写入进行合并,这让我感到困惑。也许 window 函数导致了这种行为?
您的查询已经优化。无法避免扫描整个 table 来获取您需要的信息,而顺序扫描是最好的方法。
确保 work_mem
足够大以便可以在内存中完成聚合 - 您可以设置 log_temp_files
以监视是否使用了临时文件(这会使事情变得更慢)。