如何提高 SQL 包含部分公共子查询的查询性能

Question

我在 PostgreSQL 13 中有一个简单的 table tableA，其中包含事件计数的时间序列。在程式化形式中，它看起来像这样：

event_count     sys_timestamp

100             167877672772
110             167877672769
121             167877672987
111             167877673877
...             ...

两个字段都定义为 numeric。

借助 Whosebug 的答案，我能够创建一个查询，该查询基本上计算给定时间跨度内正负过量事件的数量，以当前事件计数为条件。查询如下所示：

SELECT t1.*,

    (SELECT COUNT(*) FROM tableA t2 
        WHERE t2.sys_timestamp > t1.sys_timestamp AND 
        t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
        t2.event_count >= t1.event_count+10)
    AS positive, 

    (SELECT COUNT(*) FROM tableA t2 
       WHERE t2.sys_timestamp > t1.sys_timestamp AND 
       t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
       t2.event_count <= t1.event_count-10) 
    AS negative 

FROM tableA as t1

查询按预期工作，returns 在此特定示例中，在给定定义时间 window (+ 1000 [毫秒]).

但是，我将不得不运行这样的查询 table 有几百万（甚至可能超过 100+ 百万）个条目，即使有大约 50 万行，查询也需要很长时间去完成。此外，虽然在给定查询中时间范围始终保持不变 [但 window 大小可能因查询而异]，但在某些情况下，我可能不得不使用 10 个类似于正/负过度的附加条件在同一个查询中。

因此，我正在寻找改进上述查询的方法，主要是为了实现更好的性能，主要考虑设想的数据集的大小，其次考虑更多的条件。

我的具体问题：

如何重用子查询的公共部分以确保它不会执行两次（或多次），即如何在查询中重用它？

 (SELECT COUNT(*) FROM tableA t2 
  WHERE t2.sys_timestamp >  t1.sys_timestamp
  AND   t2.sys_timestamp <= t1.sys_timestamp + 1000)

将当前 numeric 的 sys_timestamp 字段转换为时间戳字段并尝试使用任何 PostgreSQL Windows 是否有一些性能优势职能？（不幸的是，我对此根本没有足够的经验。）
除了重用可显着提高大型数据集性能的（部分）子查询之外，是否有一些巧妙的方法来重写查询？
使用 Java、Scala、Python 等工具，这些类型的查询在数据库外运行是否可能更快？

Answer 1

How can I reuse the common portion of the subquery ...?

在单个 LATERAL 子查询中使用条件聚合：

SELECT t1.*, t2.positive, t2.negative
FROM   tableA t1
CROSS  JOIN LATERAL (
   SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
        , COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
   FROM   tableA t2 
   WHERE  t2.sys_timestamp >  t1.sys_timestamp
   AND    t2.sys_timestamp <= t1.sys_timestamp + 1000
   ) t2;

可以是CROSS JOIN因为子查询总是returns一行。参见：

将条件聚合与 FILTER 子句结合使用，使多个聚合基于同一时间范围。参见：

Aggregate columns with additional (distinct) filters

event_count 应该是 integer 或 bigint。参见：

Is there any difference in saving same value in different integer types?

sys_timestamp 应该是 timestamp 或 timestamptz。参见：

Ignoring time zones altogether in Rails and PostgreSQL

(sys_timestamp) 上的索引是对此的最低要求。 (sys_timestamp, event_count) 上的多列索引通常可以提供更多帮助。如果 table 足够 vacuum，你会从中获得仅索引扫描。

根据确切的数据分布（最重要的是有多少时间框架重叠）和其他数据库特征，定制的程序解决方案可能会更快，但是。可以用任何客户端语言完成。但是服务器端 PL/pgsql 解决方案更优越，因为它节省了到数据库服务器的所有往返行程和类型转换等。请参阅：

Window Functions or Common Table Expressions: count previous rows within range
What are the pros and cons of performing calculations in sql vs. in your application

Answer 2

你的想法是对的。编写可在查询中重用的语句的方法是“with”语句（也称为子查询分解）。 “with”语句运行一次作为主查询的子查询，可以被后续子查询或最终查询重用。

第一步包括创建父子详细信息行 - table 乘以自身并按时间戳过滤。

然后下一步是对其他所有内容重复使用相同的详细查询。

假设 event_count 是一个主索引，或者您在 event_count 和 sys_timestamp 上有一个复合索引，这看起来像：

with baseQuery as
(
   SELECT distinct t1.event_count as startEventCount, t1.event_count+10 as pEndEventCount 
   ,t1.eventCount-10 as nEndEventCount, t2.event_count as t2EventCount
   FROM tableA t1, tableA t2 
   where t2.sys_timestamp between t1.sys_timestamp AND t1.sys_timestamp + 1000
), posSummary as
(
   select bq.startEventCount, count(*) as positive
   from baseQuery bq
   where t2EventCount between bq.startEventCount and bq.pEndEventCount
   group by bq.startEventCount 
), negSummary as
(
   select bq.startEventCount, count(*) as negative
   from baseQuery bq
   where t2EventCount between bq.startEventCount and bq.nEndEventCount
   group by bq.startEventCount 
)
select t1.*, ps.positive, nv.negative
from tableA t1 
inner join posSummary ps on t1.event_count=ps.startEventCount
inner join negSummary ns on t1.event_count=ns.startEventCount

备注：

根据您的实际密钥，baseQuery 的 distinct 可能不是必需的。
最终连接是通过 tableA 完成的，但也可以使用 baseQuery 的摘要作为单独的“with”语句，该语句已经运行一次。似乎没有必要。

您可以尝试一下，看看有什么用。

当然还有其他方法，但这最能说明可以改进的方式和地方。

With 语句用于多维数据仓库查询，因为当您有如此多的数据要与如此多的 tables（维度和事实）连接时，隔离查询的策略有助于了解索引的位置是需要的，也许还有如何最小化查询需要进一步处理直到完成的行。例如，很明显，如果您可以最大限度地减少 baseQuery 中返回的行数或使其运行更快（检查解释计划），您的查询将得到整体改善。

如何提高 SQL 包含部分公共子查询的查询性能

How to improve SQL query performance containing partially common subqueries

sql

postgresql

window-functions

postgresql-performance

conditional-aggregation