使用多个表的条件优化查询
Optimize query with condition on multiple tables
我有两个 postgres table
Table一个
id
owner_id
1
100
2
101
Table B
id
a_id
user_id
1
1
200
2
1
201
3
2
202
4
2
201
id
在两个 table 上都是 PK
和 integer
我在 a.owner_id
, b.a_id
, b.user_id
上有 B-Tree
索引
第一次查询
在下面的查询中
SELECT b.id
FROM b JOIN a ON b.a_id = a.id
WHERE b.user_id = 201
OR a.owner_id = 100
LIMIT 50;
我有 WHERE b.user_id = 201 OR a.owner_id = 100
条件,查询计划使用了 b.user_id
的索引,但未使用 a.owner_id
的索引,这是查询计划
QUERY PLAN
Limit (cost=19.54..4445.84 rows=50 width=4) (actual time=0.125..5.031 rows=50 loops=1)
Buffers: shared hit=1054
-> Merge Join (cost=19.54..9815083.22 rows=110872 width=4) (actual time=0.123..5.018 rows=50 loops=1)
Merge Cond: (a.id = b.a_id)
Join Filter: ((b.user_id = 201) OR (a.owner_id = 100))
Rows Removed by Join Filter: 5547
Buffers: shared hit=1054
-> Index Scan using a_pkey on a (cost=0.42..103568.63 rows=100009 width=20) (actual time=0.011..0.037 rows=50 loops=1)
Buffers: shared hit=10
-> Index Scan using b_a_id on b (cost=0.43..9515274.99 rows=11200116 width=24) (actual time=0.009..3.136 rows=5597 loops=1)
Buffers: shared hit=1044
Planning Time: 0.626 ms
Execution Time: 5.082 ms
查询速度有点慢,如何让它更快?
第二次查询
还有另一个较慢的查询
SELECT b.id
FROM b JOIN a ON b.a_id = a.id
WHERE (b.user_id = 201 AND a.owner_id = 100)
OR (b.user_id = 100 AND a.owner_id = 201)
LIMIT 50;
QUERY PLAN
Limit (cost=1000.43..19742.38 rows=50 width=4) (actual time=0.705..63.142 rows=50 loops=1)
Buffers: shared hit=1419 read=3994
-> Gather (cost=1000.43..75593.36 rows=199 width=4) (actual time=0.704..63.124 rows=50 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1419 read=3994
-> Nested Loop (cost=0.43..74573.46 rows=83 width=4) (actual time=0.752..13.122 rows=17 loops=3)
Buffers: shared hit=1419 read=3994
-> Parallel Seq Scan on a (cost=0.00..25628.06 rows=83 width=20) (actual time=0.669..11.868 rows=17 loops=3)
Filter: ((owner_id = 100) OR (owner_id = 201))
Rows Removed by Filter: 16985
Buffers: shared hit=258 read=3994
-> Index Scan using b_a_id on b (cost=0.43..589.69 rows=1 width=24) (actual time=0.023..0.070 rows=1 loops=52)
Index Cond: (a_id = a.id)
Filter: (((user_id = 201) OR (user_id = 100)) AND (((user_id = 201) AND (a.owner_id = 100)) OR ((a.owner_id = 201) AND (user_id = 100))))
Rows Removed by Filter: 105
Buffers: shared hit=1161
Planning Time: 0.638 ms
Execution Time: 63.202 ms
使用UNION
而不是OR
:
SELECT * FROM ((SELECT b.id
FROM b JOIN a ON b.a_id = a.id
WHERE b.user_id = 201
LIMIT 50)
UNION
(SELECT b.id
FROM b JOIN a ON b.a_id = a.id
WHERE a.owner_id = 100
LIMIT 50)) AS q
LIMIT 50;
a(owner_id)
、a(id)
、b(user_id)
和 b(a_id)
上的索引将使它变得更快。
创建测试数据...
CREATE UNLOGGED TABLE a AS SELECT a_id, (random()*100000)::INTEGER owner_id
FROM generate_series(1,1000000) a_id;
CREATE UNLOGGED TABLE b AS SELECT b_id, (random()*100000)::INTEGER a_id, (random()*100000)::INTEGER user_id
FROM generate_series(1,10000000) b_id;
CREATE INDEX a_o ON a(owner_id);
CREATE INDEX b_a ON b(a_id);
CREATE INDEX b_u ON b(user_id);
ALTER TABLE a ADD PRIMARY KEY(a_id);
ALTER TABLE b ADD PRIMARY KEY(b_id);
VACUUM ANALYZE a,b;
第一个查询的问题是 postgres 不知道如何优化星型连接,所以我们必须给它一点帮助。
WITH ids AS (
SELECT a_id FROM b WHERE user_id=201
UNION SELECT a_id FROM a WHERE owner_id=100
)
SELECT * FROM ids JOIN b USING (a_id) LIMIT 50;
这给出了一个使用两个索引的计划,在您的情况下可能会更快,也可能不会。
Limit (cost=455.41..634.97 rows=50 width=12) (actual time=0.494..0.642 rows=50 loops=1)
-> Nested Loop (cost=455.41..41596.19 rows=11456 width=12) (actual time=0.492..0.629 rows=50 loops=1)
-> HashAggregate (cost=450.19..451.32 rows=113 width=4) (actual time=0.425..0.427 rows=1 loops=1)
Group Key: b_1.a_id
Batches: 1 Memory Usage: 24kB
-> Append (cost=5.23..449.91 rows=113 width=4) (actual time=0.076..0.358 rows=98 loops=1)
-> Bitmap Heap Scan on b b_1 (cost=5.23..401.21 rows=102 width=4) (actual time=0.075..0.299 rows=92 loops=1)
Recheck Cond: (user_id = 201)
Heap Blocks: exact=92
-> Bitmap Index Scan on b_u (cost=0.00..5.20 rows=102 width=0) (actual time=0.035..0.035 rows=92 loops=1)
Index Cond: (user_id = 201)
-> Bitmap Heap Scan on a (cost=4.51..47.00 rows=11 width=4) (actual time=0.019..0.033 rows=6 loops=1)
Recheck Cond: (owner_id = 100)
Heap Blocks: exact=6
-> Bitmap Index Scan on a_o (cost=0.00..4.51 rows=11 width=0) (actual time=0.014..0.014 rows=6 loops=1)
Index Cond: (owner_id = 100)
-> Bitmap Heap Scan on b (cost=5.22..363.09 rows=101 width=12) (actual time=0.059..0.174 rows=50 loops=1)
Recheck Cond: (a_id = b_1.a_id)
Heap Blocks: exact=50
-> Bitmap Index Scan on b_a (cost=0.00..5.19 rows=101 width=0) (actual time=0.023..0.023 rows=104 loops=1)
Index Cond: (a_id = b_1.a_id)
Planning Time: 0.448 ms
Execution Time: 0.747 ms
至于其他查询,我不得不运行这个:
select owner_id, user_id, count(*) from a join b using (a_id) group by owner_id,user_id order by count(*) desc limit 100;
从我的测试数据中得到一些 user_id,owner_id 实际上会 return 结果。那么,
EXPLAIN ANALYZE
SELECT b.*
FROM b JOIN a USING (a_id)
WHERE (b.user_id = 99238 AND a.owner_id = 58599)
OR (b.user_id = 36859 AND a.owner_id = 99027)
LIMIT 50;
Limit (cost=24.97..532.32 rows=1 width=12) (actual time=0.274..0.982 rows=6 loops=1)
-> Nested Loop (cost=24.97..532.32 rows=1 width=12) (actual time=0.271..0.976 rows=6 loops=1)
-> Bitmap Heap Scan on a (cost=9.03..92.70 rows=22 width=8) (actual time=0.108..0.216 rows=12 loops=1)
Recheck Cond: ((owner_id = 58599) OR (owner_id = 99027))
Heap Blocks: exact=12
-> BitmapOr (cost=9.03..9.03 rows=22 width=0) (actual time=0.086..0.088 rows=0 loops=1)
-> Bitmap Index Scan on a_o (cost=0.00..4.51 rows=11 width=0) (actual time=0.064..0.065 rows=3 loops=1)
Index Cond: (owner_id = 58599)
-> Bitmap Index Scan on a_o (cost=0.00..4.51 rows=11 width=0) (actual time=0.020..0.020 rows=9 loops=1)
Index Cond: (owner_id = 99027)
-> Bitmap Heap Scan on b (cost=15.95..19.97 rows=1 width=12) (actual time=0.058..0.060 rows=0 loops=12)
Recheck Cond: ((a_id = a.a_id) AND ((user_id = 99238) OR (user_id = 36859)))
Filter: (((user_id = 99238) AND (a.owner_id = 58599)) OR ((user_id = 36859) AND (a.owner_id = 99027)))
Heap Blocks: exact=6
-> BitmapAnd (cost=15.95..15.95 rows=1 width=0) (actual time=0.053..0.053 rows=0 loops=12)
-> Bitmap Index Scan on b_a (cost=0.00..5.19 rows=101 width=0) (actual time=0.015..0.015 rows=50 loops=12)
Index Cond: (a_id = a.a_id)
-> BitmapOr (cost=10.50..10.50 rows=205 width=0) (actual time=0.046..0.046 rows=0 loops=6)
-> Bitmap Index Scan on b_u (cost=0.00..5.20 rows=102 width=0) (actual time=0.021..0.021 rows=121 loops=6)
Index Cond: (user_id = 99238)
-> Bitmap Index Scan on b_u (cost=0.00..5.20 rows=102 width=0) (actual time=0.024..0.024 rows=105 loops=6)
Index Cond: (user_id = 36859)
Planning Time: 0.703 ms
Execution Time: 1.063 ms
它不像你的那样使用序列扫描,所以也许你的旧版本无法正确优化它?当行数估计非常准确时,它会选择 table a 的 seq 扫描,这很奇怪。你应该调查一下,也许试试
SELECT * FROM a WHERE a.owner_id = 58599 OR a.owner_id = 99027
LIMIT 50;
这应该给出索引或位图索引扫描,如果它进行序列扫描,那么你有一个小测试用例来找出原因。无论如何,您仍然可以强制使用索引:
EXPLAIN ANALYZE
WITH ids AS (
SELECT a_id FROM b WHERE user_id IN (99238,36859)
UNION SELECT a_id FROM a WHERE owner_id IN (58599,99027)
)
SELECT * FROM ids JOIN b USING (a_id) JOIN a USING (a_id)
WHERE (b.user_id = 99238 AND a.owner_id = 58599)
OR (b.user_id = 36859 AND a.owner_id = 99027);
...但它非常丑陋。或者你可以分别在你的 OR 中做每个子句,然后用这个做 AND 很多,这也很丑陋:
EXPLAIN ANALYZE
SELECT a_id FROM b WHERE b.user_id = 99238
INTERSECT
SELECT a_id FROM a WHERE a.owner_id = 58599
LIMIT 50;
How can I optimize large OFFSETs
实际上,当使用大的偏移量时,通常会通过重复执行相同的查询(例如分页)和显示结果块来暗示您做错了。有两种解决方法。如果获取结果的速度足够快,以至于事务可以在您执行此操作时保持打开状态,请为查询打开一个游标,而不使用 LIMIT 或 OFFSET 并使用 FETCH 获取块中的结果。否则,在没有 LIMIT 的情况下执行一次查询,将结果存储在缓存中,然后从中分页而不重做查询。
我有两个 postgres table
Table一个
id | owner_id |
---|---|
1 | 100 |
2 | 101 |
Table B
id | a_id | user_id |
---|---|---|
1 | 1 | 200 |
2 | 1 | 201 |
3 | 2 | 202 |
4 | 2 | 201 |
id
在两个 table 上都是 PK
和 integer
我在 a.owner_id
, b.a_id
, b.user_id
B-Tree
索引
第一次查询
在下面的查询中
SELECT b.id
FROM b JOIN a ON b.a_id = a.id
WHERE b.user_id = 201
OR a.owner_id = 100
LIMIT 50;
我有 WHERE b.user_id = 201 OR a.owner_id = 100
条件,查询计划使用了 b.user_id
的索引,但未使用 a.owner_id
的索引,这是查询计划
QUERY PLAN
Limit (cost=19.54..4445.84 rows=50 width=4) (actual time=0.125..5.031 rows=50 loops=1)
Buffers: shared hit=1054
-> Merge Join (cost=19.54..9815083.22 rows=110872 width=4) (actual time=0.123..5.018 rows=50 loops=1)
Merge Cond: (a.id = b.a_id)
Join Filter: ((b.user_id = 201) OR (a.owner_id = 100))
Rows Removed by Join Filter: 5547
Buffers: shared hit=1054
-> Index Scan using a_pkey on a (cost=0.42..103568.63 rows=100009 width=20) (actual time=0.011..0.037 rows=50 loops=1)
Buffers: shared hit=10
-> Index Scan using b_a_id on b (cost=0.43..9515274.99 rows=11200116 width=24) (actual time=0.009..3.136 rows=5597 loops=1)
Buffers: shared hit=1044
Planning Time: 0.626 ms
Execution Time: 5.082 ms
查询速度有点慢,如何让它更快?
第二次查询
还有另一个较慢的查询
SELECT b.id
FROM b JOIN a ON b.a_id = a.id
WHERE (b.user_id = 201 AND a.owner_id = 100)
OR (b.user_id = 100 AND a.owner_id = 201)
LIMIT 50;
QUERY PLAN
Limit (cost=1000.43..19742.38 rows=50 width=4) (actual time=0.705..63.142 rows=50 loops=1)
Buffers: shared hit=1419 read=3994
-> Gather (cost=1000.43..75593.36 rows=199 width=4) (actual time=0.704..63.124 rows=50 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1419 read=3994
-> Nested Loop (cost=0.43..74573.46 rows=83 width=4) (actual time=0.752..13.122 rows=17 loops=3)
Buffers: shared hit=1419 read=3994
-> Parallel Seq Scan on a (cost=0.00..25628.06 rows=83 width=20) (actual time=0.669..11.868 rows=17 loops=3)
Filter: ((owner_id = 100) OR (owner_id = 201))
Rows Removed by Filter: 16985
Buffers: shared hit=258 read=3994
-> Index Scan using b_a_id on b (cost=0.43..589.69 rows=1 width=24) (actual time=0.023..0.070 rows=1 loops=52)
Index Cond: (a_id = a.id)
Filter: (((user_id = 201) OR (user_id = 100)) AND (((user_id = 201) AND (a.owner_id = 100)) OR ((a.owner_id = 201) AND (user_id = 100))))
Rows Removed by Filter: 105
Buffers: shared hit=1161
Planning Time: 0.638 ms
Execution Time: 63.202 ms
使用UNION
而不是OR
:
SELECT * FROM ((SELECT b.id
FROM b JOIN a ON b.a_id = a.id
WHERE b.user_id = 201
LIMIT 50)
UNION
(SELECT b.id
FROM b JOIN a ON b.a_id = a.id
WHERE a.owner_id = 100
LIMIT 50)) AS q
LIMIT 50;
a(owner_id)
、a(id)
、b(user_id)
和 b(a_id)
上的索引将使它变得更快。
创建测试数据...
CREATE UNLOGGED TABLE a AS SELECT a_id, (random()*100000)::INTEGER owner_id
FROM generate_series(1,1000000) a_id;
CREATE UNLOGGED TABLE b AS SELECT b_id, (random()*100000)::INTEGER a_id, (random()*100000)::INTEGER user_id
FROM generate_series(1,10000000) b_id;
CREATE INDEX a_o ON a(owner_id);
CREATE INDEX b_a ON b(a_id);
CREATE INDEX b_u ON b(user_id);
ALTER TABLE a ADD PRIMARY KEY(a_id);
ALTER TABLE b ADD PRIMARY KEY(b_id);
VACUUM ANALYZE a,b;
第一个查询的问题是 postgres 不知道如何优化星型连接,所以我们必须给它一点帮助。
WITH ids AS (
SELECT a_id FROM b WHERE user_id=201
UNION SELECT a_id FROM a WHERE owner_id=100
)
SELECT * FROM ids JOIN b USING (a_id) LIMIT 50;
这给出了一个使用两个索引的计划,在您的情况下可能会更快,也可能不会。
Limit (cost=455.41..634.97 rows=50 width=12) (actual time=0.494..0.642 rows=50 loops=1)
-> Nested Loop (cost=455.41..41596.19 rows=11456 width=12) (actual time=0.492..0.629 rows=50 loops=1)
-> HashAggregate (cost=450.19..451.32 rows=113 width=4) (actual time=0.425..0.427 rows=1 loops=1)
Group Key: b_1.a_id
Batches: 1 Memory Usage: 24kB
-> Append (cost=5.23..449.91 rows=113 width=4) (actual time=0.076..0.358 rows=98 loops=1)
-> Bitmap Heap Scan on b b_1 (cost=5.23..401.21 rows=102 width=4) (actual time=0.075..0.299 rows=92 loops=1)
Recheck Cond: (user_id = 201)
Heap Blocks: exact=92
-> Bitmap Index Scan on b_u (cost=0.00..5.20 rows=102 width=0) (actual time=0.035..0.035 rows=92 loops=1)
Index Cond: (user_id = 201)
-> Bitmap Heap Scan on a (cost=4.51..47.00 rows=11 width=4) (actual time=0.019..0.033 rows=6 loops=1)
Recheck Cond: (owner_id = 100)
Heap Blocks: exact=6
-> Bitmap Index Scan on a_o (cost=0.00..4.51 rows=11 width=0) (actual time=0.014..0.014 rows=6 loops=1)
Index Cond: (owner_id = 100)
-> Bitmap Heap Scan on b (cost=5.22..363.09 rows=101 width=12) (actual time=0.059..0.174 rows=50 loops=1)
Recheck Cond: (a_id = b_1.a_id)
Heap Blocks: exact=50
-> Bitmap Index Scan on b_a (cost=0.00..5.19 rows=101 width=0) (actual time=0.023..0.023 rows=104 loops=1)
Index Cond: (a_id = b_1.a_id)
Planning Time: 0.448 ms
Execution Time: 0.747 ms
至于其他查询,我不得不运行这个:
select owner_id, user_id, count(*) from a join b using (a_id) group by owner_id,user_id order by count(*) desc limit 100;
从我的测试数据中得到一些 user_id,owner_id 实际上会 return 结果。那么,
EXPLAIN ANALYZE
SELECT b.*
FROM b JOIN a USING (a_id)
WHERE (b.user_id = 99238 AND a.owner_id = 58599)
OR (b.user_id = 36859 AND a.owner_id = 99027)
LIMIT 50;
Limit (cost=24.97..532.32 rows=1 width=12) (actual time=0.274..0.982 rows=6 loops=1)
-> Nested Loop (cost=24.97..532.32 rows=1 width=12) (actual time=0.271..0.976 rows=6 loops=1)
-> Bitmap Heap Scan on a (cost=9.03..92.70 rows=22 width=8) (actual time=0.108..0.216 rows=12 loops=1)
Recheck Cond: ((owner_id = 58599) OR (owner_id = 99027))
Heap Blocks: exact=12
-> BitmapOr (cost=9.03..9.03 rows=22 width=0) (actual time=0.086..0.088 rows=0 loops=1)
-> Bitmap Index Scan on a_o (cost=0.00..4.51 rows=11 width=0) (actual time=0.064..0.065 rows=3 loops=1)
Index Cond: (owner_id = 58599)
-> Bitmap Index Scan on a_o (cost=0.00..4.51 rows=11 width=0) (actual time=0.020..0.020 rows=9 loops=1)
Index Cond: (owner_id = 99027)
-> Bitmap Heap Scan on b (cost=15.95..19.97 rows=1 width=12) (actual time=0.058..0.060 rows=0 loops=12)
Recheck Cond: ((a_id = a.a_id) AND ((user_id = 99238) OR (user_id = 36859)))
Filter: (((user_id = 99238) AND (a.owner_id = 58599)) OR ((user_id = 36859) AND (a.owner_id = 99027)))
Heap Blocks: exact=6
-> BitmapAnd (cost=15.95..15.95 rows=1 width=0) (actual time=0.053..0.053 rows=0 loops=12)
-> Bitmap Index Scan on b_a (cost=0.00..5.19 rows=101 width=0) (actual time=0.015..0.015 rows=50 loops=12)
Index Cond: (a_id = a.a_id)
-> BitmapOr (cost=10.50..10.50 rows=205 width=0) (actual time=0.046..0.046 rows=0 loops=6)
-> Bitmap Index Scan on b_u (cost=0.00..5.20 rows=102 width=0) (actual time=0.021..0.021 rows=121 loops=6)
Index Cond: (user_id = 99238)
-> Bitmap Index Scan on b_u (cost=0.00..5.20 rows=102 width=0) (actual time=0.024..0.024 rows=105 loops=6)
Index Cond: (user_id = 36859)
Planning Time: 0.703 ms
Execution Time: 1.063 ms
它不像你的那样使用序列扫描,所以也许你的旧版本无法正确优化它?当行数估计非常准确时,它会选择 table a 的 seq 扫描,这很奇怪。你应该调查一下,也许试试
SELECT * FROM a WHERE a.owner_id = 58599 OR a.owner_id = 99027
LIMIT 50;
这应该给出索引或位图索引扫描,如果它进行序列扫描,那么你有一个小测试用例来找出原因。无论如何,您仍然可以强制使用索引:
EXPLAIN ANALYZE
WITH ids AS (
SELECT a_id FROM b WHERE user_id IN (99238,36859)
UNION SELECT a_id FROM a WHERE owner_id IN (58599,99027)
)
SELECT * FROM ids JOIN b USING (a_id) JOIN a USING (a_id)
WHERE (b.user_id = 99238 AND a.owner_id = 58599)
OR (b.user_id = 36859 AND a.owner_id = 99027);
...但它非常丑陋。或者你可以分别在你的 OR 中做每个子句,然后用这个做 AND 很多,这也很丑陋:
EXPLAIN ANALYZE
SELECT a_id FROM b WHERE b.user_id = 99238
INTERSECT
SELECT a_id FROM a WHERE a.owner_id = 58599
LIMIT 50;
How can I optimize large OFFSETs
实际上,当使用大的偏移量时,通常会通过重复执行相同的查询(例如分页)和显示结果块来暗示您做错了。有两种解决方法。如果获取结果的速度足够快,以至于事务可以在您执行此操作时保持打开状态,请为查询打开一个游标,而不使用 LIMIT 或 OFFSET 并使用 FETCH 获取块中的结果。否则,在没有 LIMIT 的情况下执行一次查询,将结果存储在缓存中,然后从中分页而不重做查询。