Hive中window函数和子查询的使用
Use of window function and subquery in Hive
我有一个 table,其中一行代表一个订单。我正在尝试编写一个查询,其中 returns 2017 年的所有客户订单,继 2017 年 1 月第二个订单的位置之后。
初始代码如下所示:
SELECT
order_date
,cust_id
,nth_booking
,total_bookings
FROM (SELECT order_date
,order_id
,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
FROM my.orders
WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31') t1
这给出了以下输出,到目前为止还不错:
-------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-01 | 123 | 1 | 4 |
| 2017-01-02 | 123 | 2 | 4 |
| 2017-01-05 | 123 | 3 | 4 |
| 2017-09-27 | 123 | 4 | 4 |
| 2017-02-02 | 456 | 1 | 3 |
| 2017-11-16 | 456 | 2 | 3 |
| 2017-12-04 | 456 | 3 | 3 |
| 2017-01-17 | 678 | 1 | 5 |
| 2017-01-30 | 678 | 2 | 5 |
| 2017-02-31 | 678 | 3 | 5 |
| 2017-05-26 | 678 | 4 | 5 |
| 2017-09-18 | 678 | 5 | 5 |
但是,由于我只想检索必须在 2017 年 1 月发生的第二笔订单之后的订单详细信息,因此我添加了一些额外的条件,以便现在的查询如下所示:
SELECT
order_date
,cust_id
,nth_booking
,total_bookings
FROM (SELECT order_date
,order_id
,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
FROM my.orders
WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31') t1
WHERE
nth_booking >= 2
AND order_date BETWEEN '2017-01-01' AND '2017-01-31'
这显然是不正确的,我当然可以在查看下面的结果时明白为什么 order_date 声明的条件得到满足:
-------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-02 | 123 | 2 | 4 |
| 2017-01-05 | 123 | 3 | 4 |
| 2017-01-30 | 678 | 2 | 5 |
然而,我想要的更类似于此,其中第二个订单是在 2017 年 1 月下达的,但我正在显示所有后续订单。
-------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-01 | 123 | 2 | 4 |
| 2017-03-05 | 123 | 3 | 4 |
| 2017-09-27 | 123 | 4 | 4 |
| 2017-01-30 | 678 | 2 | 5 |
| 2017-02-31 | 678 | 3 | 5 |
| 2017-05-26 | 678 | 4 | 5 |
| 2017-09-18 | 678 | 5 | 5 |
如何访问此视图?
如果能提供任何指导,我将不胜感激,并希望我已提供足够可重现的方法和工作细节。
提前致谢
为 cust_id
计算 second_order_jan flag
并将其用于过滤:
select
order_date
,cust_id
,nth_booking
,total_bookings
from
( --calculate second_order_jan flag for the cust_id
select cust_id,
order_date,
order_id,
nth_booking,
total_bookings,
max(case when month(order_date) = 1 and nth_booking=2 then 1 end) over (partition by cust_id) second_order_jan_flag
from
(
SELECT cust_id,
order_date
,order_id
,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
FROM my.orders
WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31'
) t1
) t2 where second_order_jan_flag =1
and nth_booking >= 2 --Filter only orders after second.
我有一个 table,其中一行代表一个订单。我正在尝试编写一个查询,其中 returns 2017 年的所有客户订单,继 2017 年 1 月第二个订单的位置之后。
初始代码如下所示:
SELECT
order_date
,cust_id
,nth_booking
,total_bookings
FROM (SELECT order_date
,order_id
,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
FROM my.orders
WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31') t1
这给出了以下输出,到目前为止还不错:
-------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-01 | 123 | 1 | 4 |
| 2017-01-02 | 123 | 2 | 4 |
| 2017-01-05 | 123 | 3 | 4 |
| 2017-09-27 | 123 | 4 | 4 |
| 2017-02-02 | 456 | 1 | 3 |
| 2017-11-16 | 456 | 2 | 3 |
| 2017-12-04 | 456 | 3 | 3 |
| 2017-01-17 | 678 | 1 | 5 |
| 2017-01-30 | 678 | 2 | 5 |
| 2017-02-31 | 678 | 3 | 5 |
| 2017-05-26 | 678 | 4 | 5 |
| 2017-09-18 | 678 | 5 | 5 |
但是,由于我只想检索必须在 2017 年 1 月发生的第二笔订单之后的订单详细信息,因此我添加了一些额外的条件,以便现在的查询如下所示:
SELECT
order_date
,cust_id
,nth_booking
,total_bookings
FROM (SELECT order_date
,order_id
,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
FROM my.orders
WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31') t1
WHERE
nth_booking >= 2
AND order_date BETWEEN '2017-01-01' AND '2017-01-31'
这显然是不正确的,我当然可以在查看下面的结果时明白为什么 order_date 声明的条件得到满足:
-------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-02 | 123 | 2 | 4 |
| 2017-01-05 | 123 | 3 | 4 |
| 2017-01-30 | 678 | 2 | 5 |
然而,我想要的更类似于此,其中第二个订单是在 2017 年 1 月下达的,但我正在显示所有后续订单。
-------------------------------------------------------
| order_date | cust_id | nth_booking | total_bookings |
-------------------------------------------------------
| 2017-01-01 | 123 | 2 | 4 |
| 2017-03-05 | 123 | 3 | 4 |
| 2017-09-27 | 123 | 4 | 4 |
| 2017-01-30 | 678 | 2 | 5 |
| 2017-02-31 | 678 | 3 | 5 |
| 2017-05-26 | 678 | 4 | 5 |
| 2017-09-18 | 678 | 5 | 5 |
如何访问此视图?
如果能提供任何指导,我将不胜感激,并希望我已提供足够可重现的方法和工作细节。
提前致谢
为 cust_id
计算 second_order_jan flag
并将其用于过滤:
select
order_date
,cust_id
,nth_booking
,total_bookings
from
( --calculate second_order_jan flag for the cust_id
select cust_id,
order_date,
order_id,
nth_booking,
total_bookings,
max(case when month(order_date) = 1 and nth_booking=2 then 1 end) over (partition by cust_id) second_order_jan_flag
from
(
SELECT cust_id,
order_date
,order_id
,COUNT (*) OVER (PARTITION BY cust_id ORDER BY order_date) AS nth_booking
,COUNT (*) OVER (PARTITION BY cust_id) AS total_bookings
FROM my.orders
WHERE order_date BETWEEN '2017-01-01' AND '2017-01-31'
) t1
) t2 where second_order_jan_flag =1
and nth_booking >= 2 --Filter only orders after second.