选择最小值时不使用索引的PostgreSQL多列组

Question

在对多个列执行 GROUP BY 操作后，在 PostgreSQL (11, 12, 13) 的列上选择 MIN 时，不使用在分组列上创建的任何索引：https://dbfiddle.uk/?rdbms=postgres_13&fiddle=30e0f341940f4c1fa6013677643a0baf

CREATE TABLE tags (id serial, series int, index int, page int);
CREATE INDEX ON tags (page, series, index);

INSERT INTO tags (series, index, page)
SELECT
    ceil(random() * 10),
    ceil(random() * 100),
    ceil(random() * 1000)
FROM generate_series(1, 100000);

EXPLAIN ANALYZE
SELECT tags.page, tags.series, MIN(tags.index)
FROM tags GROUP BY tags.page, tags.series;

HashAggregate  (cost=2291.00..2391.00 rows=10000 width=12) (actual time=108.968..133.153 rows=9999 loops=1)
  Group Key: page, series
  Batches: 1  Memory Usage: 1425kB
  ->  Seq Scan on tags  (cost=0.00..1541.00 rows=100000 width=12) (actual time=0.015..55.240 rows=100000 loops=1)
Planning Time: 0.257 ms
Execution Time: 133.771 ms

理论上，索引应该允许数据库以 (tags.page, tags.series) 的步骤进行查找，而不是执行完整扫描。这将导致上述数据集处理 10,000 行而不是 100,000 行。 This link 描述了没有分组列的方法。

This answer (as well as ) 建议使用 DISTINCT ON 进行排序而不是 GROUP BY 但这会生成此查询计划：

Unique  (cost=0.42..5680.42 rows=10000 width=12) (actual time=0.066..268.038 rows=9999 loops=1)
  ->  Index Only Scan using tags_page_series_index_idx on tags  (cost=0.42..5180.42 rows=100000 width=12) (actual time=0.064..227.219 rows=100000 loops=1)
        Heap Fetches: 100000
Planning Time: 0.426 ms
Execution Time: 268.712 ms

虽然正在使用索引，但它似乎仍在扫描整组行。使用 SET enable_seqscan=OFF 时，GROUP BY 查询会降级为相同的行为。

如何鼓励 PostgreSQL 使用多列索引？

Answer 1

如果您可以从另一个 table 中提取一组不同的页面，系列，那么您可以通过横向连接来破解它：

CREATE TABLE pageseries AS SELECT DISTINCT page,series FROM tags ORDER BY page,series;
EXPLAIN ANALYZE SELECT p.*, minindex FROM pageseries p CROSS JOIN LATERAL (SELECT index minindex FROM tags t WHERE t.page=p.page AND t.series=p.series ORDER BY page,series,index LIMIT 1) x;
 Nested Loop  (cost=0.42..8720.00 rows=10000 width=12) (actual time=0.039..56.013 rows=10000 loops=1)
   ->  Seq Scan on pageseries p  (cost=0.00..145.00 rows=10000 width=8) (actual time=0.012..1.872 rows=10000 loops=1)
   ->  Limit  (cost=0.42..0.84 rows=1 width=12) (actual time=0.005..0.005 rows=1 loops=10000)
         ->  Index Only Scan using tags_page_series_index_idx on tags t  (cost=0.42..4.62 rows=10 width=12) (actual time=0.004..0.004 rows=1 loops=10000)
               Index Cond: ((page = p.page) AND (series = p.series))
               Heap Fetches: 0
 Planning Time: 0.168 ms
 Execution Time: 57.077 ms

...但不一定更快：

EXPLAIN ANALYZE                                                                                                                                              SELECT tags.page, tags.series, MIN(tags.index)
FROM tags GROUP BY tags.page, tags.series;

 HashAggregate  (cost=2291.00..2391.00 rows=10000 width=12) (actual time=56.177..58.923 rows=10000 loops=1)
   Group Key: page, series
   Batches: 1  Memory Usage: 1425kB
   ->  Seq Scan on tags  (cost=0.00..1541.00 rows=100000 width=12) (actual time=0.010..12.845 rows=100000 loops=1)
 Planning Time: 0.129 ms
 Execution Time: 59.644 ms

如果嵌套循环中的迭代次数很少，换句话说，如果不同的（页面，系列）数量很少，那么速度会快得多。我将单独尝试系列，因为它只有 10 个不同的值：

CREATE TABLE series AS SELECT DISTINCT series FROM tags;
EXPLAIN ANALYZE SELECT p.*, minindex FROM series p CROSS JOIN LATERAL (SELECT index minindex FROM tags t WHERE t.series=p.series ORDER BY series,index LIMIT 1) x;
 Nested Loop  (cost=0.29..886.18 rows=2550 width=8) (actual time=0.081..0.264 rows=10 loops=1)
   ->  Seq Scan on series p  (cost=0.00..35.50 rows=2550 width=4) (actual time=0.007..0.010 rows=10 loops=1)
   ->  Limit  (cost=0.29..0.31 rows=1 width=8) (actual time=0.024..0.024 rows=1 loops=10)
         ->  Index Only Scan using tags_series_index_idx on tags t  (cost=0.29..211.29 rows=10000 width=8) (actual time=0.023..0.023 rows=1 loops=10)
               Index Cond: (series = p.series)
               Heap Fetches: 0
 Planning Time: 0.198 ms
 Execution Time: 0.292 ms

在这种情况下，绝对值得，因为查询只命中 10/100000 行。其他查询命中 10000/100000 行，或 table 的 10%，这高于索引真正有用的阈值。

请注意，将基数较低的列放在前面会导致较小的索引：

CREATE INDEX ON tags (series, page, index);
select pg_relation_size( 'tags_page_series_index_idx' );
          4284416
select pg_relation_size( 'tags_series_page_index_idx' );
          3104768

...但它不会使查询更快。

如果这类东西真的很重要，也许可以试试 clickhouse 或 dolphindb。

Answer 2

要支持那种东西，PostgreSQL 必须有类似 索引跳过扫描 的东西，只有在组很少的情况下才有效。

如果查询速度很重要，您可以考虑使用物化视图。

选择最小值时不使用索引的PostgreSQL多列组

PostgreSQL multi-column group by not using index when selecting minimum

postgresql

indexing

group-by

query-optimization

aggregate-functions