使用 Order By 计算分区中的行数
Count rows in partition with Order By
我试图通过编写一些示例查询来理解 postgres 中的 PARTITION BY。我有一个测试 table,我在上面 运行 我的查询。
id integer | num integer
___________|_____________
1 | 4
2 | 4
3 | 5
4 | 6
当我 运行 以下查询时,我得到了预期的输出。
SELECT id, COUNT(id) OVER(PARTITION BY num) from test;
id | count
___________|_____________
1 | 2
2 | 2
3 | 1
4 | 1
但是,当我将 ORDER BY 添加到分区时,
SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;
id | count
___________|_____________
1 | 1
2 | 2
3 | 1
4 | 1
我的理解是,COUNT 是针对属于分区的所有行计算的。在这里,我按 num 对行进行了分区。分区中的行数是相同的,有或没有 ORDER BY 子句。为什么输出有差异?
当您将 order by
添加到用作 window 函数的聚合时,聚合会变成 "running count"(或您使用的任何聚合)。
count(*)
将根据指定的顺序 return 直到 "current one" 为止的行数。
以下查询显示了与 order by
一起使用的聚合的不同结果。使用 sum()
而不是 count()
更容易看到(在我看来)。
with test (id, num, x) as (
values
(1, 4, 1),
(2, 4, 1),
(3, 5, 2),
(4, 6, 2)
)
select id,
num,
x,
count(*) over () as total_rows,
count(*) over (order by id) as rows_upto,
count(*) over (partition by x order by id) as rows_per_x,
sum(num) over (partition by x) as total_for_x,
sum(num) over (order by id) as sum_upto,
sum(num) over (partition by x order by id) as sum_for_x_upto
from test;
将导致:
id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
1 | 4 | 1 | 4 | 1 | 1 | 8 | 4 | 4
2 | 4 | 1 | 4 | 2 | 2 | 8 | 8 | 8
3 | 5 | 2 | 4 | 3 | 1 | 11 | 13 | 5
4 | 6 | 2 | 4 | 4 | 2 | 11 | 19 | 11
中还有更多示例
你的两个表达式是:
COUNT(id) OVER (PARTITION BY num)
COUNT(id) OVER (PARTITION BY num ORDER BY id)
为什么您希望它们 return 具有相同的值?语法不同是有原因的。
第一个 return 是每个 num
的总计数 -- 本质上是将聚合值重新连接起来。
第二个进行累计计数。它为 id
的每一行执行 COUNT()
,对于所有值达到 id
的值。
请注意,此类累积计数通常会使用 RANK()
(或相关函数)来实现。
累积计数与 RANK()
略有不同。累计计数实现:
COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK()
略有不同。差异仅在 ORDER BY
键有关系时才重要。
“为什么”别人已经解释过了。有时你有一个有序的 window,你必须对整个分区进行计数,尽管有一个 ORDER BY
.
为此,请使用 unbounded range 和 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
create table search_log
(
id bigint not null primary key,
query varchar(255) not null,
stemmed_query varchar(255) not null,
created timestamp not null,
);
SELECT query,
created as seen_on,
first_value(created) OVER query_window as last_seen,
row_number() OVER query_window AS rn,
count(*) OVER query_window AS occurence
FROM search_log l
WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
我试图通过编写一些示例查询来理解 postgres 中的 PARTITION BY。我有一个测试 table,我在上面 运行 我的查询。
id integer | num integer
___________|_____________
1 | 4
2 | 4
3 | 5
4 | 6
当我 运行 以下查询时,我得到了预期的输出。
SELECT id, COUNT(id) OVER(PARTITION BY num) from test;
id | count
___________|_____________
1 | 2
2 | 2
3 | 1
4 | 1
但是,当我将 ORDER BY 添加到分区时,
SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;
id | count
___________|_____________
1 | 1
2 | 2
3 | 1
4 | 1
我的理解是,COUNT 是针对属于分区的所有行计算的。在这里,我按 num 对行进行了分区。分区中的行数是相同的,有或没有 ORDER BY 子句。为什么输出有差异?
当您将 order by
添加到用作 window 函数的聚合时,聚合会变成 "running count"(或您使用的任何聚合)。
count(*)
将根据指定的顺序 return 直到 "current one" 为止的行数。
以下查询显示了与 order by
一起使用的聚合的不同结果。使用 sum()
而不是 count()
更容易看到(在我看来)。
with test (id, num, x) as (
values
(1, 4, 1),
(2, 4, 1),
(3, 5, 2),
(4, 6, 2)
)
select id,
num,
x,
count(*) over () as total_rows,
count(*) over (order by id) as rows_upto,
count(*) over (partition by x order by id) as rows_per_x,
sum(num) over (partition by x) as total_for_x,
sum(num) over (order by id) as sum_upto,
sum(num) over (partition by x order by id) as sum_for_x_upto
from test;
将导致:
id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
1 | 4 | 1 | 4 | 1 | 1 | 8 | 4 | 4
2 | 4 | 1 | 4 | 2 | 2 | 8 | 8 | 8
3 | 5 | 2 | 4 | 3 | 1 | 11 | 13 | 5
4 | 6 | 2 | 4 | 4 | 2 | 11 | 19 | 11
中还有更多示例
你的两个表达式是:
COUNT(id) OVER (PARTITION BY num)
COUNT(id) OVER (PARTITION BY num ORDER BY id)
为什么您希望它们 return 具有相同的值?语法不同是有原因的。
第一个 return 是每个 num
的总计数 -- 本质上是将聚合值重新连接起来。
第二个进行累计计数。它为 id
的每一行执行 COUNT()
,对于所有值达到 id
的值。
请注意,此类累积计数通常会使用 RANK()
(或相关函数)来实现。
累积计数与 RANK()
略有不同。累计计数实现:
COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK()
略有不同。差异仅在 ORDER BY
键有关系时才重要。
“为什么”别人已经解释过了。有时你有一个有序的 window,你必须对整个分区进行计数,尽管有一个 ORDER BY
.
为此,请使用 unbounded range 和 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
create table search_log
(
id bigint not null primary key,
query varchar(255) not null,
stemmed_query varchar(255) not null,
created timestamp not null,
);
SELECT query,
created as seen_on,
first_value(created) OVER query_window as last_seen,
row_number() OVER query_window AS rn,
count(*) OVER query_window AS occurence
FROM search_log l
WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)