在不同列上使用 GROUP BY、COUNT(DISTINCT) 和 SUM 进行快速 PostgreSQL 查询
Fast PostgreSQL query with GROUP BY, COUNT(DISTINCT) and SUM on differrent columns
我正在尝试查询一个 table 大约有 150 万条记录。我有索引,而且性能很好。
但是,其中一列我想获得一个不同列(有很多重复项)的计数。当我执行 DISTINCT 与不执行时,它的速度要慢 10 倍。
这是查询:
SELECT
created_at,
SUM(amount) as total,
COUNT(DISTINCT partner_id) as count_partners
FROM
consumption
WHERE
is_official = true
AND
(is_processed = true OR is_deferred = true)
GROUP BY created_at
这需要 2.5 秒
如果我成功了:
COUNT(partner_id) as count_partners
需要230毫秒。但这不是我想要的。
我想要每个分组(日期)的唯一一组合作伙伴以及他们在该期间消费的金额总和。
我不明白为什么这么慢。 PostgreSQL 似乎非常快速地创建了一个包含所有重复项的数组,为什么简单地向它添加 DISTINCT 会破坏它的性能?
查询计划:
GroupAggregate (cost=85780.70..91461.63 rows=12 width=24) (actual time=1019.428..2641.434 rows=13 loops=1)
Output: created_at, sum(amount), count(DISTINCT partner_id)"
Group Key: p.created_at
Buffers: shared hit=16487
-> Sort (cost=85780.70..87200.90 rows=568081 width=16) (actual time=865.599..945.674 rows=568318 loops=1)
Output: created_at, amount, partner_id
Sort Key: p.created_at
Sort Method: quicksort Memory: 62799kB
Buffers: shared hit=16487
-> Seq Scan on public.consumption p (cost=0.00..31484.26 rows=568081 width=16) (actual time=0.020..272.126 rows=568318 loops=1)
Output: created_at, amount, partner_id
Filter: (p.is_official AND (p.is_deferred OR p.is_processed))
Rows Removed by Filter: 931408
Buffers: shared hit=16487
Planning Time: 0.191 ms
Execution Time: 2647.629 ms
索引:
CREATE INDEX IF NOT EXISTS i_pid ON consumption (partner_id);
CREATE INDEX IF NOT EXISTS i_processed ON consumption (is_processed);
CREATE INDEX IF NOT EXISTS i_official ON consumption (is_official);
CREATE INDEX IF NOT EXISTS i_deferred ON consumption (is_deferred);
CREATE INDEX IF NOT EXISTS i_created ON consumption (created_at);
以下查询应该能够从索引中受益。
SELECT
created_at,
SUM(amount) AS total,
COUNT(DISTINCT partner_id) AS count_partners
FROM
(SELECT
created_at,
sum(amount) as amount,
partner_id
FROM consumption
WHERE is_official = true
AND (is_processed = true OR is_deferred = true)
GROUP BY
created_at,
partner_id
) AS c
GROUP BY created_at;
我正在尝试查询一个 table 大约有 150 万条记录。我有索引,而且性能很好。
但是,其中一列我想获得一个不同列(有很多重复项)的计数。当我执行 DISTINCT 与不执行时,它的速度要慢 10 倍。
这是查询:
SELECT
created_at,
SUM(amount) as total,
COUNT(DISTINCT partner_id) as count_partners
FROM
consumption
WHERE
is_official = true
AND
(is_processed = true OR is_deferred = true)
GROUP BY created_at
这需要 2.5 秒
如果我成功了:
COUNT(partner_id) as count_partners
需要230毫秒。但这不是我想要的。
我想要每个分组(日期)的唯一一组合作伙伴以及他们在该期间消费的金额总和。
我不明白为什么这么慢。 PostgreSQL 似乎非常快速地创建了一个包含所有重复项的数组,为什么简单地向它添加 DISTINCT 会破坏它的性能?
查询计划:
GroupAggregate (cost=85780.70..91461.63 rows=12 width=24) (actual time=1019.428..2641.434 rows=13 loops=1)
Output: created_at, sum(amount), count(DISTINCT partner_id)"
Group Key: p.created_at
Buffers: shared hit=16487
-> Sort (cost=85780.70..87200.90 rows=568081 width=16) (actual time=865.599..945.674 rows=568318 loops=1)
Output: created_at, amount, partner_id
Sort Key: p.created_at
Sort Method: quicksort Memory: 62799kB
Buffers: shared hit=16487
-> Seq Scan on public.consumption p (cost=0.00..31484.26 rows=568081 width=16) (actual time=0.020..272.126 rows=568318 loops=1)
Output: created_at, amount, partner_id
Filter: (p.is_official AND (p.is_deferred OR p.is_processed))
Rows Removed by Filter: 931408
Buffers: shared hit=16487
Planning Time: 0.191 ms
Execution Time: 2647.629 ms
索引:
CREATE INDEX IF NOT EXISTS i_pid ON consumption (partner_id);
CREATE INDEX IF NOT EXISTS i_processed ON consumption (is_processed);
CREATE INDEX IF NOT EXISTS i_official ON consumption (is_official);
CREATE INDEX IF NOT EXISTS i_deferred ON consumption (is_deferred);
CREATE INDEX IF NOT EXISTS i_created ON consumption (created_at);
以下查询应该能够从索引中受益。
SELECT
created_at,
SUM(amount) AS total,
COUNT(DISTINCT partner_id) AS count_partners
FROM
(SELECT
created_at,
sum(amount) as amount,
partner_id
FROM consumption
WHERE is_official = true
AND (is_processed = true OR is_deferred = true)
GROUP BY
created_at,
partner_id
) AS c
GROUP BY created_at;