未使用 PostgresQL 索引

Question

我有一个 table 有几百万行称为 item 的列如下所示：

CREATE TABLE item (
  id bigint NOT NULL,
  company_id bigint NOT NULL,
  date_created timestamp with time zone,
  ....
)

company_id

上有一个索引

CREATE INDEX idx_company_id ON photo USING btree (company_id);

这个 table 经常搜索某个客户的最后 10 件商品，即

SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;

目前，一个客户约占 table 中数据的 75%，另外 25% 的数据分布在 25 个左右的其他客户中，这意味着 75%行的公司 ID 为 5，其他行的公司 ID 介于 6 和 25 之间。

除了主要公司 (id = 5) 之外，所有公司的查询通常运行都非常快。我能理解为什么 company_id 上的索引可以用于除 5.

以外的公司

我已经尝试使用不同的索引来提高公司 5 的搜索效率。似乎最有意义的是

CREATE INDEX idx_date_created
ON item (date_created DESC NULLS LAST);

如果我添加此索引，对主要公司 (id = 5) 的查询会大大改善，但对所有其他公司的查询都会变得很糟糕。

使用和不使用新索引对公司 ID 5 和 6 进行 EXPLAIN ANALYZE 的一些结果：

公司编号 5

新索引之前

QUERY PLAN
Limit  (cost=214874.63..214874.65 rows=10 width=639) (actual time=10481.989..10482.017 rows=10 loops=1)
  ->  Sort  (cost=214874.63..218560.33 rows=1474282 width=639) (actual time=10481.985..10481.994 rows=10 loops=1)
        Sort Key: photo_created
        Sort Method: top-N heapsort  Memory: 35kB
        ->  Seq Scan on photo  (cost=0.00..183015.92 rows=1474282 width=639) (actual time=0.009..5345.551 rows=1473561 loops=1)
              Filter: (company_id = 5)
              Rows Removed by Filter: 402513
Total runtime: 10482.075 ms

新索引后：

QUERY PLAN
Limit  (cost=0.43..1.98 rows=10 width=639) (actual time=0.087..0.120 rows=10 loops=1)
  ->  Index Scan using idx_photo__photo_created on photo  (cost=0.43..228408.04 rows=1474282 width=639) (actual time=0.084..0.099 rows=10 loops=1)
        Filter: (company_id = 5)
        Rows Removed by Filter: 26
Total runtime: 0.164 ms

公司编号 6

新索引之前：

QUERY PLAN
Limit  (cost=2204.27..2204.30 rows=10 width=639) (actual time=0.044..0.053 rows=3 loops=1)
  ->  Sort  (cost=2204.27..2207.55 rows=1310 width=639) (actual time=0.040..0.044 rows=3 loops=1)
        Sort Key: photo_created
        Sort Method: quicksort  Memory: 28kB
        ->  Index Scan using idx_photo__company_id on photo  (cost=0.43..2175.96 rows=1310 width=639) (actual time=0.020..0.026 rows=3 loops=1)
              Index Cond: (company_id = 6)
Total runtime: 0.100 ms

新索引后：

QUERY PLAN
Limit  (cost=0.43..1744.00 rows=10 width=639) (actual time=0.039..3938.986 rows=3 loops=1)
  ->  Index Scan using idx_photo__photo_created on photo  (cost=0.43..228408.04 rows=1310 width=639) (actual time=0.035..3938.975 rows=3 loops=1)
        Filter: (company_id = 6)
        Rows Removed by Filter: 1876071
Total runtime: 3939.028 ms

我在 table 上有运行完整的 VACUUM 和 ANALYZE，因此 PostgreSQL 应该有最新的统计信息。有什么想法可以让 PostgreSQL 为被查询的公司选择正确的索引吗？

Answer 1

这被称为 "abort-early plan problem"，多年来一直是一个长期的错误优化。提前中止的计划在可行时令人惊奇，但在不可行时则很糟糕；有关更详细的说明，请参阅链接的邮件列表线程。基本上，规划器认为它会在不扫描整个 date_created 索引的情况下为客户 6 找到您想要的 10 行，这是错误的。

在 PostgreSQL 10（非测试版）之前，没有任何硬性方法可以明确改进此查询。您要做的是以各种方式推动查询规划器，以期获得您想要的结果。主要方法包括使 PostgreSQL 更可能使用多列索引的任何方法，例如：

降低 random_page_cost（如果您使用的是 SSD，这无论如何是个好主意）。
降低cpu_index_tuple_cost

您也可以通过使用 table 统计数据来修复计划器行为。这包括：

再次为 table 和运行 ANALYZE 提高 statistics_target，以使 PostgreSQL 获取更多样本并更好地了解行分布；
增加 n_distinct 统计数据以准确反映 customer_ids 或不同 created_dates 的数量。

但是，所有这些解决方案都是近似的，如果将来随着数据的变化查询性能出现问题，这应该是您首先查看的查询。

在 PostgreSQL 10 中，您将能够创建 Cross-Column Stats，这应该会更可靠地改善这种情况。根据这对您来说有多糟糕，您可以尝试使用测试版。

如果 none 有效，我建议使用 Freenode 或 pgsql-performance mailing list 上的#postgresql IRC 频道。那里的人会询问您详细的 table 统计数据，以便提出一些建议。

Answer 2

还有一点：为什么要创建索引

CREATE INDEX idx_date_created ON item (date_created DESC NULLS LAST);

但是调用：

SELECT * FROM item WHERE company_id = 5 ORDER BY date_created LIMIT 10;

可能是你的意思

SELECT * FROM item WHERE company_id = 5 ORDER BY date_created DESC NULLS LAST LIMIT 10;

也最好创建组合索引：

CREATE INDEX idx_company_id_date_created ON item (company_id, date_created DESC NULLS LAST);

之后：

                                                                     QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..28.11 rows=10 width=16) (actual time=0.120..0.153 rows=10 loops=1)
   ->  Index Only Scan using idx_company_id_date_created on item  (cost=0.43..20763.68 rows=7500 width=16) (actual time=0.118..0.145 rows=10 loops=1)
         Index Cond: (company_id = 5)
         Heap Fetches: 10
 Planning time: 1.003 ms
 Execution time: 0.209 ms
(6 rows)
                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..28.11 rows=10 width=16) (actual time=0.085..0.115 rows=10 loops=1)
   ->  Index Only Scan using idx_company_id_date_created on item  (cost=0.43..20763.68 rows=7500 width=16) (actual time=0.084..0.108 rows=10 loops=1)
         Index Cond: (company_id = 6)
         Heap Fetches: 10
 Planning time: 0.136 ms
 Execution time: 0.155 ms
(6 rows)

在您的服务器上，它可能会慢一些，但无论如何都比上面的示例好得多。

未使用 PostgresQL 索引

PostgresQL index not used

postgresql

database-indexes