如何让 date_part 查询命中索引?
How to get date_part query to hit index?
我还没有能够让这个查询命中索引而不是执行完整扫描 - 我有另一个查询使用 date_part('day', datelocal) 针对几乎相同的table(table 只是数据少一点,但结构相同),那个会命中我在 datelocal 列上创建的索引(这是一个没有时区的时间戳)。查询(这个在 table 上执行并行 seq 扫描并进行内存快速排序):
SELECT
date_part('hour', datelocal) AS hour,
SUM(CASE WHEN gender LIKE 'male' THEN views ELSE 0 END) AS male,
SUM(CASE WHEN gender LIKE 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE datelocal >= '2-1-2019' AND datelocal < '2-28-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)
这是另一个确实命中了我的本地日期索引的:
SELECT
date_part('day', datelocal) AS day,
SUM(CASE WHEN gender LIKE 'male' THEN views ELSE 0 END) AS male,
SUM(CASE WHEN gender LIKE 'female' THEN views ELSE 0 END) AS female
FROM reportimpressionday
WHERE datelocal >= '2-1-2019' AND datelocal < '2-28-2019'
GROUP BY date_trunc('day', datelocal), date_part('day', datelocal)
ORDER BY date_trunc('day', datelocal)
这让我很头疼!关于如何加快第一个或至少使其达到索引的任何想法?我试过在 datelocal 字段上创建一个索引,在 datelocal、性别和视图上创建一个复合索引,在 date_part('hour', datelocal) 上创建一个表达式索引,但是 none成功了。
模式:
-- Table Definition ----------------------------------------------
CREATE TABLE reportimpression (
datelocal timestamp without time zone,
devicename text,
network text,
sitecode text,
advertisername text,
mediafilename text,
gender text,
agegroup text,
views integer,
impressions integer,
dwelltime numeric
);
-- Indices -------------------------------------------------------
CREATE INDEX reportimpression_datelocal_index ON reportimpression(datelocal timestamp_ops);
CREATE INDEX reportimpression_viewership_index ON reportimpression(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);
CREATE INDEX reportimpression_test_index ON reportimpression(datelocal timestamp_ops,(date_part('hour'::text, datelocal)) float8_ops);
-- Table Definition ----------------------------------------------
CREATE TABLE reportimpressionday (
datelocal timestamp without time zone,
devicename text,
network text,
sitecode text,
advertisername text,
mediafilename text,
gender text,
agegroup text,
views integer,
impressions integer,
dwelltime numeric
);
-- Indices -------------------------------------------------------
CREATE INDEX reportimpressionday_datelocal_index ON reportimpressionday(datelocal timestamp_ops);
CREATE INDEX reportimpressionday_detail_index ON reportimpressionday(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);
解释(分析、缓冲)输出:
Finalize GroupAggregate (cost=999842.42..999859.67 rows=3137 width=24) (actual time=43754.700..43754.714 rows=24 loops=1)
Group Key: (date_part('hour'::text, datelocal))
Buffers: shared hit=123912 read=823290
I/O Timings: read=81228.280
-> Sort (cost=999842.42..999843.99 rows=3137 width=24) (actual time=43754.695..43754.698 rows=48 loops=1)
Sort Key: (date_part('hour'::text, datelocal))
Sort Method: quicksort Memory: 28kB
Buffers: shared hit=123912 read=823290
I/O Timings: read=81228.280
-> Gather (cost=999481.30..999805.98 rows=3137 width=24) (actual time=43754.520..43777.558 rows=48 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=123912 read=823290
I/O Timings: read=81228.280
-> Partial HashAggregate (cost=998481.30..998492.28 rows=3137 width=24) (actual time=43751.649..43751.672 rows=24 loops=2)
Group Key: date_part('hour'::text, datelocal)
Buffers: shared hit=123912 read=823290
I/O Timings: read=81228.280
-> Parallel Seq Scan on reportimpression (cost=0.00..991555.98 rows=2770129 width=17) (actual time=13.097..42974.126 rows=2338145 loops=2)
Filter: ((datelocal >= '2019-02-01 00:00:00'::timestamp without time zone) AND (datelocal < '2019-02-28 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 6792750
Buffers: shared hit=123912 read=823290
I/O Timings: read=81228.280
Planning time: 0.185 ms
Execution time: 43777.701 ms
好吧,您的两个查询都在不同的 table 上(reportimpression
与 reportimpressionday
),因此两个查询的比较实际上不是比较。你 ANALYZE
都有吗?各种列统计信息也可能发挥作用。索引或 table 膨胀可能不同。所有行中的大部分是否符合 2019 年 2 月的条件?等等
在黑暗中拍摄,比较两者的百分比 table:
SELECT tbl, round(share * 100 / total, 2) As percentage
FROM (
SELECT text 'reportimpression' AS tbl
, count(*)::numeric AS total
, count(*) FILTER (WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01')::numeric AS share
FROM reportimpression
UNION ALL
SELECT 'reportimpressionday'
, count(*)
, count(*) FILTER (WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01')
FROM reportimpressionday
) sub;
reportimpression
的那个更大吗?然后它可能刚好超过索引预期提供帮助的数量。
通常,您在 (datelocal) 上的索引 reportimpression_datelocal_index
看起来不错,如果 autovacuum 超过 table 上的写入负载,reportimpression_viewership_index
甚至允许仅索引扫描。 (虽然 impressions
和 agegroup
只是空运,没有它会更好)。
回答
我的查询得到了 26.6 percent, and day is 26.4 percent
。对于如此大的百分比,索引通常 根本没有用 。顺序扫描通常是最快的方式。如果基础 table 大得多,则仅索引扫描 可能 仍然有意义。 (或者你有 严重 table 膨胀,而膨胀程度较低的索引,这使得索引再次更具吸引力。)
您的第一个查询可能刚好跨越临界点。尝试缩小时间范围,直到看到仅索引扫描。您不会看到(位图)索引扫描超过大约 5% 的所有行符合条件(取决于许多因素)。
查询
尽管如此,请考虑这些修改后的查询:
SELECT date_part('hour', datelocal) AS hour
, SUM(views) FILTER (WHERE gender = 'male') AS male
, SUM(views) FILTER (WHERE gender = 'female') AS female
FROM reportimpression
WHERE datelocal >= '2019-02-01'
AND datelocal < '2019-03-01' -- '2019-02-28' -- ?
GROUP BY 1
ORDER BY 1;
SELECT date_trunc('day', datelocal) AS day
, SUM(views) FILTER (WHERE gender = 'male') AS male
, SUM(views) FILTER (WHERE gender = 'female') AS female
FROM reportimpressionday
WHERE datelocal >= '2019-02-01'
AND datelocal < '2019-03-01'
GROUP BY 1
ORDER BY 1;
要点
当使用像 '2-1-2019'
这样的本地化 日期格式 时,通过 to_timestamp()
使用明确的格式说明符。否则这取决于语言环境设置,并且在从具有不同设置的会话调用时可能会中断(静默地)。而是使用不依赖于区域设置的 ISO 日期/时间格式。
您似乎想要包括二月份的 整个月。但是您的查询错过了上限。首先,二月可能有 29 天。 datelocal < '2-28-2019'
也排除了 2 月 28 日的所有时间。请改用 datelocal < '2019-03-01'
。
如果可以的话,按与 SELECT
列表中相同的表达式 进行分组和排序会更便宜。所以也在那里使用 date_trunc()
。不要在没有必要的情况下使用不同的表达方式。如果您需要结果中的日期部分,请将其应用于分组表达式,例如:
SELECT date_part('day', date_trunc('day', datelocal)) AS day
...
GROUP BY date_trunc('day', datelocal)
ORDER BY date_trunc('day', datelocal);
有点嘈杂的代码,但速度更快(也可能更容易针对查询规划器进行优化)。
在 Postgres 9.4 或更高版本中使用 聚合 FILTER
子句。它更干净,速度更快。参见:
- How can I simplify this game statistics query?
- For absolute performance, is SUM faster or COUNT?
我还没有能够让这个查询命中索引而不是执行完整扫描 - 我有另一个查询使用 date_part('day', datelocal) 针对几乎相同的table(table 只是数据少一点,但结构相同),那个会命中我在 datelocal 列上创建的索引(这是一个没有时区的时间戳)。查询(这个在 table 上执行并行 seq 扫描并进行内存快速排序):
SELECT
date_part('hour', datelocal) AS hour,
SUM(CASE WHEN gender LIKE 'male' THEN views ELSE 0 END) AS male,
SUM(CASE WHEN gender LIKE 'female' THEN views ELSE 0 END) AS female
FROM reportimpression
WHERE datelocal >= '2-1-2019' AND datelocal < '2-28-2019'
GROUP BY date_part('hour', datelocal)
ORDER BY date_part('hour', datelocal)
这是另一个确实命中了我的本地日期索引的:
SELECT
date_part('day', datelocal) AS day,
SUM(CASE WHEN gender LIKE 'male' THEN views ELSE 0 END) AS male,
SUM(CASE WHEN gender LIKE 'female' THEN views ELSE 0 END) AS female
FROM reportimpressionday
WHERE datelocal >= '2-1-2019' AND datelocal < '2-28-2019'
GROUP BY date_trunc('day', datelocal), date_part('day', datelocal)
ORDER BY date_trunc('day', datelocal)
这让我很头疼!关于如何加快第一个或至少使其达到索引的任何想法?我试过在 datelocal 字段上创建一个索引,在 datelocal、性别和视图上创建一个复合索引,在 date_part('hour', datelocal) 上创建一个表达式索引,但是 none成功了。
模式:
-- Table Definition ----------------------------------------------
CREATE TABLE reportimpression (
datelocal timestamp without time zone,
devicename text,
network text,
sitecode text,
advertisername text,
mediafilename text,
gender text,
agegroup text,
views integer,
impressions integer,
dwelltime numeric
);
-- Indices -------------------------------------------------------
CREATE INDEX reportimpression_datelocal_index ON reportimpression(datelocal timestamp_ops);
CREATE INDEX reportimpression_viewership_index ON reportimpression(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);
CREATE INDEX reportimpression_test_index ON reportimpression(datelocal timestamp_ops,(date_part('hour'::text, datelocal)) float8_ops);
-- Table Definition ----------------------------------------------
CREATE TABLE reportimpressionday (
datelocal timestamp without time zone,
devicename text,
network text,
sitecode text,
advertisername text,
mediafilename text,
gender text,
agegroup text,
views integer,
impressions integer,
dwelltime numeric
);
-- Indices -------------------------------------------------------
CREATE INDEX reportimpressionday_datelocal_index ON reportimpressionday(datelocal timestamp_ops);
CREATE INDEX reportimpressionday_detail_index ON reportimpressionday(datelocal timestamp_ops,views int4_ops,impressions int4_ops,gender text_ops,agegroup text_ops);
解释(分析、缓冲)输出:
Finalize GroupAggregate (cost=999842.42..999859.67 rows=3137 width=24) (actual time=43754.700..43754.714 rows=24 loops=1)
Group Key: (date_part('hour'::text, datelocal))
Buffers: shared hit=123912 read=823290
I/O Timings: read=81228.280
-> Sort (cost=999842.42..999843.99 rows=3137 width=24) (actual time=43754.695..43754.698 rows=48 loops=1)
Sort Key: (date_part('hour'::text, datelocal))
Sort Method: quicksort Memory: 28kB
Buffers: shared hit=123912 read=823290
I/O Timings: read=81228.280
-> Gather (cost=999481.30..999805.98 rows=3137 width=24) (actual time=43754.520..43777.558 rows=48 loops=1)
Workers Planned: 1
Workers Launched: 1
Buffers: shared hit=123912 read=823290
I/O Timings: read=81228.280
-> Partial HashAggregate (cost=998481.30..998492.28 rows=3137 width=24) (actual time=43751.649..43751.672 rows=24 loops=2)
Group Key: date_part('hour'::text, datelocal)
Buffers: shared hit=123912 read=823290
I/O Timings: read=81228.280
-> Parallel Seq Scan on reportimpression (cost=0.00..991555.98 rows=2770129 width=17) (actual time=13.097..42974.126 rows=2338145 loops=2)
Filter: ((datelocal >= '2019-02-01 00:00:00'::timestamp without time zone) AND (datelocal < '2019-02-28 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 6792750
Buffers: shared hit=123912 read=823290
I/O Timings: read=81228.280
Planning time: 0.185 ms
Execution time: 43777.701 ms
好吧,您的两个查询都在不同的 table 上(reportimpression
与 reportimpressionday
),因此两个查询的比较实际上不是比较。你 ANALYZE
都有吗?各种列统计信息也可能发挥作用。索引或 table 膨胀可能不同。所有行中的大部分是否符合 2019 年 2 月的条件?等等
在黑暗中拍摄,比较两者的百分比 table:
SELECT tbl, round(share * 100 / total, 2) As percentage
FROM (
SELECT text 'reportimpression' AS tbl
, count(*)::numeric AS total
, count(*) FILTER (WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01')::numeric AS share
FROM reportimpression
UNION ALL
SELECT 'reportimpressionday'
, count(*)
, count(*) FILTER (WHERE datelocal >= '2019-02-01' AND datelocal < '2019-03-01')
FROM reportimpressionday
) sub;
reportimpression
的那个更大吗?然后它可能刚好超过索引预期提供帮助的数量。
通常,您在 (datelocal) 上的索引 reportimpression_datelocal_index
看起来不错,如果 autovacuum 超过 table 上的写入负载,reportimpression_viewership_index
甚至允许仅索引扫描。 (虽然 impressions
和 agegroup
只是空运,没有它会更好)。
回答
我的查询得到了 26.6 percent, and day is 26.4 percent
。对于如此大的百分比,索引通常 根本没有用 。顺序扫描通常是最快的方式。如果基础 table 大得多,则仅索引扫描 可能 仍然有意义。 (或者你有 严重 table 膨胀,而膨胀程度较低的索引,这使得索引再次更具吸引力。)
您的第一个查询可能刚好跨越临界点。尝试缩小时间范围,直到看到仅索引扫描。您不会看到(位图)索引扫描超过大约 5% 的所有行符合条件(取决于许多因素)。
查询
尽管如此,请考虑这些修改后的查询:
SELECT date_part('hour', datelocal) AS hour
, SUM(views) FILTER (WHERE gender = 'male') AS male
, SUM(views) FILTER (WHERE gender = 'female') AS female
FROM reportimpression
WHERE datelocal >= '2019-02-01'
AND datelocal < '2019-03-01' -- '2019-02-28' -- ?
GROUP BY 1
ORDER BY 1;
SELECT date_trunc('day', datelocal) AS day
, SUM(views) FILTER (WHERE gender = 'male') AS male
, SUM(views) FILTER (WHERE gender = 'female') AS female
FROM reportimpressionday
WHERE datelocal >= '2019-02-01'
AND datelocal < '2019-03-01'
GROUP BY 1
ORDER BY 1;
要点
当使用像
'2-1-2019'
这样的本地化 日期格式 时,通过to_timestamp()
使用明确的格式说明符。否则这取决于语言环境设置,并且在从具有不同设置的会话调用时可能会中断(静默地)。而是使用不依赖于区域设置的 ISO 日期/时间格式。您似乎想要包括二月份的 整个月。但是您的查询错过了上限。首先,二月可能有 29 天。
datelocal < '2-28-2019'
也排除了 2 月 28 日的所有时间。请改用datelocal < '2019-03-01'
。如果可以的话,按与
SELECT
列表中相同的表达式 进行分组和排序会更便宜。所以也在那里使用date_trunc()
。不要在没有必要的情况下使用不同的表达方式。如果您需要结果中的日期部分,请将其应用于分组表达式,例如:SELECT date_part('day', date_trunc('day', datelocal)) AS day ... GROUP BY date_trunc('day', datelocal) ORDER BY date_trunc('day', datelocal);
有点嘈杂的代码,但速度更快(也可能更容易针对查询规划器进行优化)。
在 Postgres 9.4 或更高版本中使用 聚合
FILTER
子句。它更干净,速度更快。参见:- How can I simplify this game statistics query?
- For absolute performance, is SUM faster or COUNT?