两列上的 Postgres where 子句

Postgres where clause over two columns

数据库 - 我正在使用 Postgres 9.6.5

我正在分析来自美国机场管理局 (RITA) 的有关航班到达和起飞的数据。 link (http://stat-computing.org/dataexpo/2009/the-data.html) 列出了 table 中的所有列。

table 有以下 29 列

No Name Description

1 Year 1987-2008

2 Month 1-12

3 DayofMonth 1-31

4 DayOfWeek 1 (Monday) - 7 (Sunday)

5 DepTime actual departure time (local, hhmm)

6 CRSDepTime scheduled departure time (local, hhmm)

7 ArrTime actual arrival time (local, hhmm)

8 CRSArrTime scheduled arrival time (local, hhmm)

9 UniqueCarrier unique carrier code

10 FlightNum flight number

11 TailNum plane tail number

12 ActualElapsedTime in minutes

13 CRSElapsedTime in minutes

14 AirTime in minutes

15 ArrDelay arrival delay, in minutes

16 DepDelay departure delay, in minutes

17 Origin origin IATA airport code

18 Dest destination IATA airport code

19 Distance in miles

20 TaxiIn taxi in time, in minutes

21 TaxiOut taxi out time in minutes

22 Cancelled was the flight cancelled?

23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)

24 Diverted 1 = yes, 0 = no

25 CarrierDelay in minutes

26 WeatherDelay in minutes

27 NASDelay in minutes

28 SecurityDelay in minutes

29 LateAircraftDelay in minutes

每年大约有一百万行。

我想找出延误超过 15 分钟时最繁忙的机场的数量。 列 DepDelay - 有延迟时间。 origin - 是机场的起始代码。

所有数据已加载到名为 'ontime'

的 table 中

我正在按以下阶段形成查询。

  1. select 延误超过 15 分钟的机场

    select origin,year,count(*) as depdelay_count from ontime 在哪里 depdelay > 15
    按年份分组,产地 按 depdelay_count desc 排序 )

  2. 现在我希望每年只抽出前 10 个机场 - 我正在做如下

    select x.origin,x.year from (with subquery as ( select origin,year,count(*) as depdelay_count from ontime 在哪里 depdelay > 15 按年份分组,产地 按 depdelay_count desc 排序 ) select origin,year,rank() over (partition by year order by depdelay_count desc) as rank from subquery) x 其中 x.rank <= 10;

  3. 现在我有 depdelay 排名前 10 位的机场 - 我想计算从这些机场起飞的航班总数。

    select origin,count() from ontime 起源于 (select x.origin from (with subquery as ( select origin,year,count() as depdelay_count from ontime 在哪里 depdelay > 15 按年份分组,产地 按 depdelay_count desc 排序 ) select origin,year,rank() over (partition by year order by depdelay_count desc) as rank from subquery) x 其中 x.rank <= 2) 按来源分组 按产地排序;

如果我通过在 year 子句中添加年份来修改第 3 步查询

---- 将是(1987 年到 2008 年)

之间的任何值
select origin,count(*) from ontime where year = (<YEAR>) origin in  
(select x.origin from (with subquery as (
    select origin,year,count(*) as depdelay_count from ontime 
    where 
    depdelay > 15
    group by year,origin 
    order by depdelay_count desc 
    )
    select origin,year,rank() over (partition by year order by depdelay_count desc) as rank from subquery) x where x.rank <= 2)
    group by origin
    order by origin;

但是从 1987 年到 2008 年的所有年份我都必须手动执行此操作,我想避免这样做。

请您帮助优化查询,这样我就可以 select 所有年份的数据,而无需每年手动 select。

我发现查询中间的 CTE 令人困惑。你基本上可以用一个 CTE/subquery:

with oy as (
      select origin, year, count(*) as numflights,
             sum( (depdelay > 15)::int ) as depdelay_count,
             row_number() over (partition by year order by sum( (depdelay > 15)::int ) desc) as seqnum
      from ontime
      group by origin, year
     ) 
select oy.*
from oy
where seqnum <= 10;

注意条件聚合的使用和使用 window 具有聚合函数的函数。