WHERE 子句使用 CTE 的值比使用常量慢？

Question

我想在 Postgres 12 上执行查询期间缓存一个变量。我遵循了如下 CTE 的方法：

-- BEGIN PART 1
with cached_vars as (
    select max(datetime) as datetime_threshold
    from locations
    where distance > 70
      and user_id = 9087
)
-- END PART 1
-- BEGIN PART 2
select *
from locations
where user_id = 9087
  and datetime > (select datetime_threshold from cached_vars)
-- END PART 2

运行上面的查询会导致性能问题。我预计总运行时间大约等于 (part1 runtime + part2 runtime)，但它需要更长的时间。

值得注意的是，当我运行只有第二部分手动 datetime_threshold 时没有性能问题。

locations table 定义为：

 id | user_id | datetime | location | distance | ...
-----------------------------------------------------

有什么方法可以将总运行时间减少到 (part1 runtime + part2 runtime) 吗？

Answer 1

如果您希望查询执行良好，我建议添加索引 locations(user_id, distance) 和 locations(user_id, datetime).

我还会使用 window 函数来表达查询：

select l.*
from (select l.*,
             max(datetime) filter (where distance > 70) over (partition by userid) as datetime_threshold
      from location l
      where userid = 9087
     ) l
where datetime > datetime_threshold;

Window 函数通常可以提高性能。不过，有了正确的索引，我不知道这两个版本会不会有本质上的不同。

Answer 2

请将查询分为两部分并将第一部分存储在临时 table 中（PostgreSQL 中的临时 table 只能在当前数据库会话中访问。）。然后将 temp table 加入第二部分。希望能加快处理速度。

 CREATE TEMPORARY TABLE temp_table_cached_vars (
       datetime_threshold timestamp
    );
    
    -- BEGIN PART 1
    with cached_vars as (
        select max(datetime) as datetime_threshold
        from locations
        where distance > 70
          and user_id = 9087
    )insert into temp_table_name select datetime_threshold from cached_vars 
    -- END PART 1
    -- BEGIN PART 2
    select *
    from locations
    where user_id = 9087
      and datetime > (select datetime_threshold from temp_table_cached_vars Limit 1)

-- END PART 2

Answer 3

您观察到的差异背后的解释是：

Postgres 具有列统计信息，可以根据为 datetime_threshold 提供的常量的值调整查询计划。使用有利的过滤器值，这可以导致更有效的查询计划。

在另一种情况下，当 datetime_threshold 必须首先在另一个 SELECT 中计算时，Postgres 必须默认为通用计划。 datetime_threshold 可以是任何东西。

差异将在 EXPLAIN 输出中变得明显。

为了确保 Postgres 针对实际 datetime_threshold 值优化第二部分，您可以运行两个单独的查询（将查询 1 的结果作为常量提供给查询 2），或者使用动态 SQL 强制每次在 PL/pgSQL 函数中重新规划查询 2。

例如

CREATE OR REPLACE FUNCTION foo(_user_id int, _distance int = 70)
  RETURNS SETOF locations
  LANGUAGE plpgsql AS
$func$
BEGIN
   RETURN QUERY EXECUTE 
     'SELECT *
      FROM   locations
      WHERE  user_id = 
      AND    datetime > '
   USING _user_id
      , (SELECT max(datetime)
         FROM   locations
         WHERE  distance > _distance
         AND    user_id = _user_id);
END
$func$;

致电：

SELECT * FROM foo(9087);

索引

完美索引为：

CREATE INDEX ON locations (user_id, distance DESC NULL LAST, date_time DESC NULLS LAST); -- for query 1
CREATE INDEX ON locations (user_id, date_time);           -- for query 2

微调取决于未公开的细节。部分索引可能是一个选项。

您的查询缓慢可能还有许多其他原因。不够详细。

Answer 4

只需在下面示例中使用的子查询中添加 Limi1。

-- BEGIN PART 1
with cached_vars as (
    select max(datetime) as datetime_threshold
    from locations
    where distance > 70
      and user_id = 9087
)
-- END PART 1
-- BEGIN PART 2
select *
from locations
where user_id = 9087
  and datetime > (select datetime_threshold from cached_vars Limit 1)
-- END PART 2

WHERE 子句使用 CTE 的值比使用常量慢？

WHERE clause is slower with value from CTE than with constant?

sql

postgresql

performance

postgresql-performance

例如

索引