WHERE 子句的不同组合的索引策略,包括。文本模式
Indexing strategy for different combinations of WHERE clauses incl. text patterns
这里继续其他问题:
执行以下查询时,它命中了我在 datelocal、views、impressions、gender、agegroup 字段上创建的复合索引:
SELECT date_part('hour', datelocal) AS hour
, SUM(views) FILTER (WHERE gender = 'male') AS male
, SUM(views) FILTER (WHERE gender = 'female') AS female
FROM reportimpression
WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01'
GROUP BY 1
ORDER BY 1;
但是,我还希望能够根据 WHERE 中的附加子句过滤此查询,例如:
SELECT date_part('hour', datelocal) AS hour
, SUM(views) FILTER (WHERE gender = 'male') AS male
, SUM(views) FILTER (WHERE gender = 'female') AS female
FROM reportimpression
WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01'
AND network LIKE '%'
GROUP BY 1
ORDER BY 1;
第二个查询比第一个慢得多,尽管它应该在少得多的记录上运行,而且它没有命中我的索引。
Table 架构:
CREATE TABLE reportimpression (
datelocal timestamp without time zone,
devicename text,
network text,
sitecode text,
advertisername text,
mediafilename text,
gender text,
agegroup text,
views integer,
impressions integer,
dwelltime numeric
);
-- Indices -------------------------------------------------------
CREATE INDEX reportimpression_datelocal_index ON reportimpression(datelocal timestamp_ops);
CREATE INDEX reportimpression_viewership_index ON reportimpression(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);
CREATE INDEX reportimpression_test_index ON reportimpression(datelocal timestamp_ops,(date_part('hour'::text, datelocal)) float8_ops);
分析输出:
Finalize GroupAggregate (cost=1005368.37..1005385.70 rows=3151 width=24) (actual time=70615.636..70615.649 rows=24 loops=1)
Group Key: (date_part('hour'::text, datelocal))
-> Sort (cost=1005368.37..1005369.94 rows=3151 width=24) (actual time=70615.631..70615.634 rows=48 loops=1)
Sort Key: (date_part('hour'::text, datelocal))
Sort Method: quicksort Memory: 28kB
-> Gather (cost=1005005.62..1005331.75 rows=3151 width=24) (actual time=70615.456..70641.208 rows=48 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Partial HashAggregate (cost=1004005.62..1004016.65 rows=3151 width=24) (actual time=70613.132..70613.152 rows=24 loops=2)
Group Key: date_part('hour'::text, datelocal)
-> Parallel Seq Scan on reportimpression (cost=0.00..996952.63 rows=2821195 width=17) (actual time=0.803..69876.914 rows=2429159 loops=2)
Filter: ((datelocal >= '2019-02-01 00:00:00'::timestamp without time zone) AND (datelocal < '2019-03-01 00:00:00'::timestamp without time zone) AND (network ~~ '%'::text))
Rows Removed by Filter: 6701736
Planning time: 0.195 ms
Execution time: 70641.349 ms
我是否需要创建额外的索引、调整我的 SELECT 或其他完全不同的东西?
您添加的谓词使用 LIKE
运算符:
AND network LIKE '%'
实际查询计划取决于您传递的内容而不是“%”。
但是,一般来说,普通的 btree 索引对此毫无用处。您将需要三元组索引或使用文本搜索基础设施或类似设施,具体取决于您可能要寻找的模式。
参见:
- PostgreSQL LIKE query performance variations
- Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
您甚至可以组合多种索引策略。示例:
如果应该是:
AND network = '<input_string>'
那么,一定要实际使用 =
运算符,而不是 LIKE
。按重要性升序排列的原因:
- 更短
- 减少混淆
- 使 Postgres 规划器的工作更简单(非常便宜)
正确
如果您不小心传递了一个包含特殊字符的字符串,您可能会得到不正确的结果。参见:
- Escape function for regular expression or LIKE patterns
这里继续其他问题:
执行以下查询时,它命中了我在 datelocal、views、impressions、gender、agegroup 字段上创建的复合索引:
SELECT date_part('hour', datelocal) AS hour
, SUM(views) FILTER (WHERE gender = 'male') AS male
, SUM(views) FILTER (WHERE gender = 'female') AS female
FROM reportimpression
WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01'
GROUP BY 1
ORDER BY 1;
但是,我还希望能够根据 WHERE 中的附加子句过滤此查询,例如:
SELECT date_part('hour', datelocal) AS hour
, SUM(views) FILTER (WHERE gender = 'male') AS male
, SUM(views) FILTER (WHERE gender = 'female') AS female
FROM reportimpression
WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01'
AND network LIKE '%'
GROUP BY 1
ORDER BY 1;
第二个查询比第一个慢得多,尽管它应该在少得多的记录上运行,而且它没有命中我的索引。
Table 架构:
CREATE TABLE reportimpression (
datelocal timestamp without time zone,
devicename text,
network text,
sitecode text,
advertisername text,
mediafilename text,
gender text,
agegroup text,
views integer,
impressions integer,
dwelltime numeric
);
-- Indices -------------------------------------------------------
CREATE INDEX reportimpression_datelocal_index ON reportimpression(datelocal timestamp_ops);
CREATE INDEX reportimpression_viewership_index ON reportimpression(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);
CREATE INDEX reportimpression_test_index ON reportimpression(datelocal timestamp_ops,(date_part('hour'::text, datelocal)) float8_ops);
分析输出:
Finalize GroupAggregate (cost=1005368.37..1005385.70 rows=3151 width=24) (actual time=70615.636..70615.649 rows=24 loops=1)
Group Key: (date_part('hour'::text, datelocal))
-> Sort (cost=1005368.37..1005369.94 rows=3151 width=24) (actual time=70615.631..70615.634 rows=48 loops=1)
Sort Key: (date_part('hour'::text, datelocal))
Sort Method: quicksort Memory: 28kB
-> Gather (cost=1005005.62..1005331.75 rows=3151 width=24) (actual time=70615.456..70641.208 rows=48 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Partial HashAggregate (cost=1004005.62..1004016.65 rows=3151 width=24) (actual time=70613.132..70613.152 rows=24 loops=2)
Group Key: date_part('hour'::text, datelocal)
-> Parallel Seq Scan on reportimpression (cost=0.00..996952.63 rows=2821195 width=17) (actual time=0.803..69876.914 rows=2429159 loops=2)
Filter: ((datelocal >= '2019-02-01 00:00:00'::timestamp without time zone) AND (datelocal < '2019-03-01 00:00:00'::timestamp without time zone) AND (network ~~ '%'::text))
Rows Removed by Filter: 6701736
Planning time: 0.195 ms
Execution time: 70641.349 ms
我是否需要创建额外的索引、调整我的 SELECT 或其他完全不同的东西?
您添加的谓词使用 LIKE
运算符:
AND network LIKE '%'
实际查询计划取决于您传递的内容而不是“%”。 但是,一般来说,普通的 btree 索引对此毫无用处。您将需要三元组索引或使用文本搜索基础设施或类似设施,具体取决于您可能要寻找的模式。
参见:
- PostgreSQL LIKE query performance variations
- Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
您甚至可以组合多种索引策略。示例:
如果应该是:
AND network = '<input_string>'
那么,一定要实际使用 =
运算符,而不是 LIKE
。按重要性升序排列的原因:
- 更短
- 减少混淆
- 使 Postgres 规划器的工作更简单(非常便宜)
正确
如果您不小心传递了一个包含特殊字符的字符串,您可能会得到不正确的结果。参见:
- Escape function for regular expression or LIKE patterns