Window 函数 select Redshift Postgres 中多数行的名称

Window function to select the name of the majority row in Redshift Postgres

我有一个这样的数据集,其中一些行有用,但已损坏。

create table pages (
  page varchar,
  cat varchar,
  hits int
);

insert into pages values
(1, 'asdf', 1),
(1, 'fdsa', 2),
(1, 'Apples', 321),
(2, 'gwegr', 30),
(2, 'hsgsdf', 2),
(2, 'Bananas', 321);

我想知道每个页面的正确类别和总点击数。正确的类别是点击次数最多的类别。 我想要一个像这样的数据集:

page | category | sum_of_hits
-----------------------------
1    | Apples   |  324
2    | Bananas  |  353

我能得到的最远的是:

SELECT page,
       last_value(cat) over (partition BY page ORDER BY hits) as category,
       sum(hits) as sum_of_hits
FROM pages
GROUP BY 1, 2

但是报错:ERROR: column "pages.hits" must appear in the GROUP BY clause or be used in an aggregate function Position: 83.

我尝试将点击量汇总 - ORDER BY max(hits) 但这没有意义,也不是我想要的。

Fiddle: http://sqlfiddle.com/#!17/cb3c2/17

使用子查询:

select page, cat, hits from
    (select page, cat, hits 
     ,max(hits) over (partition by page) as m_hits 
     from pages) t
where m_hits = hits

在派生的 table(FROM 子句中的子查询)中对 hits 的相反顺序使用 window 函数 first_value()

select 
    page, 
    category,
    sum(hits) as sum_of_hits
from (
    select
        page,
        first_value(cat) over (partition by page order by hits desc) as category,
        hits
    from pages
    ) s
group by 1, 2
order by 1;

SqlFiddle.

这里有两个问题:

首先是last_value的用法。阅读 Postgres documentation 中关于最后一个值的注释:

Note that first_value, last_value, and nth_value consider only the rows within the "window frame", which by default contains the rows from the start of the partition through the last peer of the current row. This is likely to give unhelpful results for nth_value and particularly last_value. You can redefine the frame as being the whole partition by adding ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING to the OVER clause. See Section 4.2.8 for more information.

我建议您将其转换为 first_value:

SELECT page,
       first_value(cat) over (partition BY page ORDER BY hits DESC) as category,
       hits
FROM pages

第二个问题是不能直接在GROUP BY子句中使用window函数,需要使用子查询或者cte:

select page, category,
       sum(hits)
from (
SELECT page,
       first_value(cat) over (partition BY page ORDER BY hits DESC) as category,
       hits
FROM pages
) a
GROUP BY 1, 2

SQL Fiddle

您似乎想取点击总和 的最大值。这很简单:

select page, sum(hits) as total_hits,
       max(case when seqnum = 1 then category end) as category
from (select p.*,
             row_number() over (partition by page order by hits desc) as seqnum
      from pages p
     ) p
group by page;