通过 id 识别在另一列中具有顺序值的行数,其中最大值等于特定值
Identify number of rows by id that have sequential values in another column, with the greatest of those values equaling a specific value
https://www.db-fiddle.com/f/2bzoKxbU2gznwwmQpMmjp5/0
(实际数据库为 Microsoft SQL Server 2014)
以上是我正在尝试做的 fiddle。
CREATE TABLE IF NOT EXISTS table1 (
id nvarchar(5) NOT NULL,
year int(4) NOT NULL,
PRIMARY KEY (id,year)
);
INSERT INTO table1 (id, year) VALUES
('A', '2013'),
('A', '2014'),
('A', '2017'),
('A', '2018'),
('B', '2016'),
('B', '2017'),
('B', '2018'),
('C', '2016'),
('D', '2014'),
('D', '2016'),
('D', '2018');
这大致是我正在使用的数据,我想在其中找到每个 id 的 consecutive/sequential 记录数,这些 id 在年份列中也包含“2018”。
到目前为止我的思考过程是这样的:
select id, count(*)
from table1
group by id;
select main.id,
case when in_2018.id is not null
then count(*)
else 0
end
from table1 as main
left join table1 as in_2018
on in_2018.id = main.id
and
in_2018.year = 2018
group by main.id;
/*
Want a table:
A | 2
B | 3
C | 0
D | 1
Count of records that are in a single-step incremental that include 2018 by id
*/
显然,这些不是 return 连续行,只是满足“2018”标准的计数。
我尝试了另一种检查方法:
case when count(*) = max(year) - min(year) +1,
在我的数据示例中,这适用于 ID B,因为 B 的所有数据都是顺序的,但它没有解决其他 ID 的损坏模式。
在 SQL 服务器中,您可以使用 row_number()
:
来解决这个问题
select top (1) id, count(*)
from (select t.*, row_number() over (partition by id order by year) as seqnum
from table1 t
) t
group by id, (year - seqnum)
having sum(case when year = 2018 then 1 else 0 end) > 0
order by count(*) desc;
这使用的观察结果是 year - seqnum
当年份在一个序列中时是常数。
在不支持 window 函数的数据库中,最简单的解决方案可能是执行相同计算的相关子查询:
select id, count(*)
from (select t.*,
(select count(*)
from table1 tt
where tt.id = t.id and tt.year <= t.year
) as seqnum
from table1 t
) t
group by id, (year - seqnum)
having sum(case when year = 2018 then 1 else 0 end) > 0
order by count(*) desc
fetch first 1 year only;
Here 是 db<>fiddle.
我看到 Gordon 在这方面领先于我,而且查询要短得多。但我已经走到这一步了,我还是 post 了。我认为总体思路或多或少是相同的,但我的不依赖于任何非标准功能(我认为),我希望我通过添加一些注释来弥补额外的代码,使其更长。 ;-)
并且每个子查询都可以 运行 分开,因此您可以看到如何逐步 'zoom in' 结果。
select
id,
max(span) as nr_of_years
from
( -- This inner query gives all the valid ranges, but they have to be deduplicates
-- For instance, it can give B 2017-2018 while there is also B 2016-2018, which has precedence.
-- That's why the outer query uses max, to get the longest range
select
s.id,
s.year,
s.otheryear,
s.span,
s.rows_in_span
from
( -- Find all possible 'spans' of years between two rows with the same id.
-- also find how much rows are in that span. They should match.
select
a.id,
a.year,
b.year as otheryear,
a.year - b.year + 1 as span,
( select count(*) from table1 c
where
c.id = a.id and
c.year >= b.year and
c.year <= a.year) as rows_in_span
from
table1 a
join table1 b on b.ID = a.ID and b.year <= a.year -- like a cross join, but per ID
) s
where
-- if they are not equal, it means one year is missing between the lowest and highest year in the span
s.span = s.rows_in_span and
-- If the difference between the year and 2018 is more than this, this is a range, but it's out of scope
abs(s.year - 2018) < s.span
) f
group by
f.id
在 fiddle 中,您可以看到它也适用于 Postgres(您可以在数据库之间切换,我简化了 create 语句以允许这样做):
https://www.db-fiddle.com/f/2bzoKxbU2gznwwmQpMmjp5/0
(实际数据库为 Microsoft SQL Server 2014)
以上是我正在尝试做的 fiddle。
CREATE TABLE IF NOT EXISTS table1 (
id nvarchar(5) NOT NULL,
year int(4) NOT NULL,
PRIMARY KEY (id,year)
);
INSERT INTO table1 (id, year) VALUES
('A', '2013'),
('A', '2014'),
('A', '2017'),
('A', '2018'),
('B', '2016'),
('B', '2017'),
('B', '2018'),
('C', '2016'),
('D', '2014'),
('D', '2016'),
('D', '2018');
这大致是我正在使用的数据,我想在其中找到每个 id 的 consecutive/sequential 记录数,这些 id 在年份列中也包含“2018”。
到目前为止我的思考过程是这样的:
select id, count(*)
from table1
group by id;
select main.id,
case when in_2018.id is not null
then count(*)
else 0
end
from table1 as main
left join table1 as in_2018
on in_2018.id = main.id
and
in_2018.year = 2018
group by main.id;
/*
Want a table:
A | 2
B | 3
C | 0
D | 1
Count of records that are in a single-step incremental that include 2018 by id
*/
显然,这些不是 return 连续行,只是满足“2018”标准的计数。
我尝试了另一种检查方法:
case when count(*) = max(year) - min(year) +1,
在我的数据示例中,这适用于 ID B,因为 B 的所有数据都是顺序的,但它没有解决其他 ID 的损坏模式。
在 SQL 服务器中,您可以使用 row_number()
:
select top (1) id, count(*)
from (select t.*, row_number() over (partition by id order by year) as seqnum
from table1 t
) t
group by id, (year - seqnum)
having sum(case when year = 2018 then 1 else 0 end) > 0
order by count(*) desc;
这使用的观察结果是 year - seqnum
当年份在一个序列中时是常数。
在不支持 window 函数的数据库中,最简单的解决方案可能是执行相同计算的相关子查询:
select id, count(*)
from (select t.*,
(select count(*)
from table1 tt
where tt.id = t.id and tt.year <= t.year
) as seqnum
from table1 t
) t
group by id, (year - seqnum)
having sum(case when year = 2018 then 1 else 0 end) > 0
order by count(*) desc
fetch first 1 year only;
Here 是 db<>fiddle.
我看到 Gordon 在这方面领先于我,而且查询要短得多。但我已经走到这一步了,我还是 post 了。我认为总体思路或多或少是相同的,但我的不依赖于任何非标准功能(我认为),我希望我通过添加一些注释来弥补额外的代码,使其更长。 ;-)
并且每个子查询都可以 运行 分开,因此您可以看到如何逐步 'zoom in' 结果。
select
id,
max(span) as nr_of_years
from
( -- This inner query gives all the valid ranges, but they have to be deduplicates
-- For instance, it can give B 2017-2018 while there is also B 2016-2018, which has precedence.
-- That's why the outer query uses max, to get the longest range
select
s.id,
s.year,
s.otheryear,
s.span,
s.rows_in_span
from
( -- Find all possible 'spans' of years between two rows with the same id.
-- also find how much rows are in that span. They should match.
select
a.id,
a.year,
b.year as otheryear,
a.year - b.year + 1 as span,
( select count(*) from table1 c
where
c.id = a.id and
c.year >= b.year and
c.year <= a.year) as rows_in_span
from
table1 a
join table1 b on b.ID = a.ID and b.year <= a.year -- like a cross join, but per ID
) s
where
-- if they are not equal, it means one year is missing between the lowest and highest year in the span
s.span = s.rows_in_span and
-- If the difference between the year and 2018 is more than this, this is a range, but it's out of scope
abs(s.year - 2018) < s.span
) f
group by
f.id
在 fiddle 中,您可以看到它也适用于 Postgres(您可以在数据库之间切换,我简化了 create 语句以允许这样做):