Hive中window函数和子查询的使用

Use of window function and subquery in Hive

我有一个 table,其中一行代表一个订单。我正在尝试编写一个查询,其中 returns 2017 年的所有客户订单,继 2017 年 1 月第二个订单的位置之后。

初始代码如下所示:

SELECT
order_date
,cust_id 
,nth_booking
,total_bookings
FROM (SELECT order_date
,order_id
,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
FROM my.orders
WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31') t1

这给出了以下输出,到目前为止还不错:

-------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-01 |   123   |       1     |       4        |
| 2017-01-02 |   123   |       2     |       4        |
| 2017-01-05 |   123   |       3     |       4        |
| 2017-09-27 |   123   |       4     |       4        |
| 2017-02-02 |   456   |       1     |       3        |
| 2017-11-16 |   456   |       2     |       3        |
| 2017-12-04 |   456   |       3     |       3        |
| 2017-01-17 |   678   |       1     |       5        |
| 2017-01-30 |   678   |       2     |       5        |
| 2017-02-31 |   678   |       3     |       5        |
| 2017-05-26 |   678   |       4     |       5        |
| 2017-09-18 |   678   |       5     |       5        |

但是,由于我只想检索必须在 2017 年 1 月发生的第二笔订单之后的订单详细信息,因此我添加了一些额外的条件,以便现在的查询如下所示:

    SELECT
    order_date
    ,cust_id 
    ,nth_booking
    ,total_bookings
    FROM (SELECT order_date
    ,order_id
    ,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
    ,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
    FROM my.orders
    WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31') t1
  WHERE 
  nth_booking >= 2
  AND order_date BETWEEN '2017-01-01' AND '2017-01-31'

这显然是不正确的,我当然可以在查看下面的结果时明白为什么 order_date 声明的条件得到满足:

-------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-02 |   123   |       2     |       4        |
| 2017-01-05 |   123   |       3     |       4        |
| 2017-01-30 |   678   |       2     |       5        |

然而,我想要的更类似于此,其中第二个订单是在 2017 年 1 月下达的,但我正在显示所有后续订单。

  -------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-01 |   123   |       2     |       4        |
| 2017-03-05 |   123   |       3     |       4        |
| 2017-09-27 |   123   |       4     |       4        |
| 2017-01-30 |   678   |       2     |       5        |
| 2017-02-31 |   678   |       3     |       5        |
| 2017-05-26 |   678   |       4     |       5        |
| 2017-09-18 |   678   |       5     |       5        |

如何访问此视图?

如果能提供任何指导,我将不胜感激,并希望我已提供足够可重现的方法和工作细节。

提前致谢

cust_id 计算 second_order_jan flag 并将其用于过滤:

select
      order_date
     ,cust_id 
     ,nth_booking
     ,total_bookings 
from
( --calculate second_order_jan flag for the cust_id
select cust_id,
       order_date,
       order_id,
       nth_booking,
       total_bookings,
       max(case when month(order_date) = 1 and nth_booking=2 then 1 end) over (partition by cust_id) second_order_jan_flag  
from 
(
SELECT cust_id,
     order_date
    ,order_id
    ,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
    ,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
    FROM my.orders
    WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31'
) t1
) t2 where second_order_jan_flag =1
       and nth_booking >= 2 --Filter only orders after second.