使用 Order By 计算分区中的行数

Question

我试图通过编写一些示例查询来理解 postgres 中的 PARTITION BY。我有一个测试 table，我在上面运行我的查询。

id integer | num integer
___________|_____________
1          | 4 
2          | 4
3          | 5
4          | 6

当我运行以下查询时，我得到了预期的输出。

SELECT id, COUNT(id) OVER(PARTITION BY num) from test;

id         | count
___________|_____________
1          | 2 
2          | 2
3          | 1
4          | 1

但是，当我将 ORDER BY 添加到分区时，

SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;

id         | count
___________|_____________
1          | 1 
2          | 2
3          | 1
4          | 1

我的理解是，COUNT 是针对属于分区的所有行计算的。在这里，我按 num 对行进行了分区。分区中的行数是相同的，有或没有 ORDER BY 子句。为什么输出有差异？

Answer 1

当您将 order by 添加到用作 window 函数的聚合时，聚合会变成 "running count"（或您使用的任何聚合）。

count(*) 将根据指定的顺序 return 直到 "current one" 为止的行数。

以下查询显示了与 order by 一起使用的聚合的不同结果。使用 sum() 而不是 count() 更容易看到（在我看来）。

with test (id, num, x) as (
  values 
    (1, 4, 1),
    (2, 4, 1),
    (3, 5, 2),
    (4, 6, 2)
)
select id, 
       num,
       x,
       count(*) over () as total_rows, 
       count(*) over (order by id) as rows_upto,
       count(*) over (partition by x order by id) as rows_per_x,
       sum(num) over (partition by x) as total_for_x,
       sum(num) over (order by id) as sum_upto,
       sum(num) over (partition by x order by id) as sum_for_x_upto
from test;

将导致：

id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
 1 |   4 | 1 |          4 |         1 |          1 |           8 |        4 |              4
 2 |   4 | 1 |          4 |         2 |          2 |           8 |        8 |              8
 3 |   5 | 2 |          4 |         3 |          1 |          11 |       13 |              5
 4 |   6 | 2 |          4 |         4 |          2 |          11 |       19 |             11

Postgres manual

中还有更多示例

Answer 2

你的两个表达式是：

COUNT(id) OVER (PARTITION BY num)

COUNT(id) OVER (PARTITION BY num ORDER BY id)

为什么您希望它们 return 具有相同的值？语法不同是有原因的。

第一个 return 是每个 num 的总计数 -- 本质上是将聚合值重新连接起来。

第二个进行累计计数。它为 id 的每一行执行 COUNT()，对于所有值达到 id 的值。

请注意，此类累积计数通常会使用 RANK()（或相关函数）来实现。累积计数与 RANK() 略有不同。累计计数实现：

COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

RANK() 略有不同。差异仅在 ORDER BY 键有关系时才重要。

Answer 3

“为什么”别人已经解释过了。有时你有一个有序的 window，你必须对整个分区进行计数，尽管有一个 ORDER BY.

为此，请使用 unbounded range 和 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

create table search_log
(
    id bigint not null primary key,
    query varchar(255) not null,
    stemmed_query varchar(255) not null,
    created timestamp not null,
);

SELECT query,
       created as seen_on,
       first_value(created) OVER query_window as last_seen,
       row_number() OVER query_window AS rn,
       count(*) OVER query_window AS occurence
FROM search_log l
     WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC 
         RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

使用 Order By 计算分区中的行数

Count rows in partition with Order By

sql

postgresql

window-functions