使用 Order By 计算分区中的行数

Count rows in partition with Order By

我试图通过编写一些示例查询来理解 postgres 中的 PARTITION BY。我有一个测试 table,我在上面 运行 我的查询。

id integer | num integer
___________|_____________
1          | 4 
2          | 4
3          | 5
4          | 6

当我 运行 以下查询时,我得到了预期的输出。

SELECT id, COUNT(id) OVER(PARTITION BY num) from test;

id         | count
___________|_____________
1          | 2 
2          | 2
3          | 1
4          | 1

但是,当我将 ORDER BY 添加到分区时,

SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;

id         | count
___________|_____________
1          | 1 
2          | 2
3          | 1
4          | 1

我的理解是,COUNT 是针对属于分区的所有行计算的。在这里,我按 num 对行进行了分区。分区中的行数是相同的,有或没有 ORDER BY 子句。为什么输出有差异?

当您将 order by 添加到用作 window 函数的聚合时,聚合会变成 "running count"(或您使用的任何聚合)。

count(*) 将根据指定的顺序 return 直到 "current one" 为止的行数。

以下查询显示了与 order by 一起使用的聚合的不同结果。使用 sum() 而不是 count() 更容易看到(在我看来)。

with test (id, num, x) as (
  values 
    (1, 4, 1),
    (2, 4, 1),
    (3, 5, 2),
    (4, 6, 2)
)
select id, 
       num,
       x,
       count(*) over () as total_rows, 
       count(*) over (order by id) as rows_upto,
       count(*) over (partition by x order by id) as rows_per_x,
       sum(num) over (partition by x) as total_for_x,
       sum(num) over (order by id) as sum_upto,
       sum(num) over (partition by x order by id) as sum_for_x_upto
from test;

将导致:

id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
 1 |   4 | 1 |          4 |         1 |          1 |           8 |        4 |              4
 2 |   4 | 1 |          4 |         2 |          2 |           8 |        8 |              8
 3 |   5 | 2 |          4 |         3 |          1 |          11 |       13 |              5
 4 |   6 | 2 |          4 |         4 |          2 |          11 |       19 |             11

Postgres manual

中还有更多示例

你的两个表达式是:

COUNT(id) OVER (PARTITION BY num)

COUNT(id) OVER (PARTITION BY num ORDER BY id)

为什么您希望它们 return 具有相同的值?语法不同是有原因的。

第一个 return 是每个 num 的总计数 -- 本质上是将聚合值重新连接起来。

第二个进行累计计数。它为 id 的每一行执行 COUNT(),对于所有值达到 id 的值。

请注意,此类累积计数通常会使用 RANK()(或相关函数)来实现。 累积计数与 RANK() 略有不同。累计计数实现:

COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

RANK() 略有不同。差异仅在 ORDER BY 键有关系时才重要。

“为什么”别人已经解释过了。有时你有一个有序的 window,你必须对整个分区进行计数,尽管有一个 ORDER BY.

为此,请使用 unbounded rangeRANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

create table search_log
(
    id bigint not null primary key,
    query varchar(255) not null,
    stemmed_query varchar(255) not null,
    created timestamp not null,
);

SELECT query,
       created as seen_on,
       first_value(created) OVER query_window as last_seen,
       row_number() OVER query_window AS rn,
       count(*) OVER query_window AS occurence
FROM search_log l
     WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC 
         RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)