如何对不同的观察组使用 LAG & LEAD window 函数
How to use LAG & LEAD window functions for different groups of observations
这是一个关于使用 Spark SQL 在 Databricks 上使用 LAG 和 LEAD window 函数的问题,但我认为这个问题不一定与特定的 SQL 方言有关。
我有一个列出不同客户 (ID) 访问的输入 table 和一个指示 "special visits":
的标志
ID | date | special_visit
-------------------------------
A | 2018-01-01 | 0
A | 2018-02-01 | 1
A | 2018-03-01 | 1
B | 2018-01-02 | 0
B | 2018-02-02 | 0
B | 2018-03-02 | 1
我想创建的是以下内容table:
ID | date | special_visit | prev_visit | next_visit | prev_special_visit | next_special_visit
---------------------------------------------------------------------------------------------------
A | 2018-01-01 | 0 | NULL | 2018-02-01 | NULL | 2018-02-01
A | 2018-02-01 | 1 | 2018-01-01 | 2018-03-01 | NULL | 2018-03-01
A | 2018-03-01 | 1 | 2018-02-01 | NULL | 2018-02-01 | NULL
B | 2018-01-02 | 0 | NULL | 2018-02-02 | NULL | 2018-03-02
B | 2018-02-02 | 0 | 2018-01-02 | 2018-03-02 | NULL | 2018-03-02
B | 2018-03-02 | 1 | 2018-02-02 | NULL | NULL | NULL
每次访问都会显示 next/previous 次访问(每次特殊访问也算作 "normal" 次访问)和每个 ID 的 next/previous 次特殊访问。
到目前为止我得到的是以下输出:
ID | date | special_visit | prev_visit | next_visit | prev_special_visit | next_special_visit
---------------------------------------------------------------------------------------------------
A | 2018-01-01 | 0 | NULL | 2018-02-01 | NULL | NULL
A | 2018-02-01 | 1 | 2018-01-01 | 2018-03-01 | NULL | 2018-03-01
A | 2018-03-01 | 1 | 2018-02-01 | NULL | 2018-02-01 | NULL
B | 2018-01-02 | 0 | NULL | 2018-02-02 | NULL | NULL
B | 2018-02-02 | 0 | 2018-01-02 | 2018-03-02 | NULL | NULL
B | 2018-03-02 | 1 | 2018-02-02 | NULL | NULL | NULL
使用此查询:
WITH special_visits AS (
SELECT ID
,date
,LAG(date) OVER (PARTITION BY ID ORDER BY date) AS prev_special_visit
,LEAD(date) OVER (PARTITION BY ID ORDER BY date) AS next_special_visit
FROM input
WHERE special_visit = 1)
SELECT ID
,special_visit
,LAG(date) OVER (PARTITION BY ID ORDER BY date) AS prev_visit
,LEAD(date) OVER (PARTITION BY ID ORDER BY date) AS next_visit
,special_visits.prev_special_visit
,special_visits.next_special_visit
FROM input
LEFT JOIN special_visits USING(ID, date)
这里的问题是,如果观察(行)本身就是特殊访问,我只能观察 previous/next 特殊访问。我希望像这样的 window 函数中的某种过滤器可能起作用:
LAG(date) OVER (PARTITION BY ID ORDER BY date WHERE special_visit = 1) AS prev_special_visit
但不幸的是它不起作用。您知道如何创建所需的输出吗?
非常感谢!
我认为 LEAD()
和 LAG()
不是最佳选择。相反:
SELECT ID, date,
LAG(date) OVER (PARTITION BY ID ORDER BY date) AS prev_visit,
LEAD(date) OVER (PARTITION BY ID ORDER BY date) AS next_visit,
MAX(CASE WHEN special_visit = 1 THEN date END) OVER (PARTITION BY ID ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) as prev_special_visit,
MIN(CASE WHEN special_visit = 1 THEN date END) OVER (PARTITION BY ID ORDER BY DATE ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) as next_special_visit
FROM input
假设日期总是增加,您可以使用行框架执行条件 min()
s 和 max()
s,如下所示:
select
t.*,
lag(date) over(partition by id order by date) pre_visit,
lead(date) over(partition by id order by date) next_visit,
max(case when special_visit = 1 then date end) over(
partition by id
order by date
row between unbounded preceding and 1 preceeding
) prev_special_visit,
min(case when special_visit = 1 then date end) over(
partition by id
order by date
row between 1 following and unbounded following
) next_special_visit
from input t
这是一个关于使用 Spark SQL 在 Databricks 上使用 LAG 和 LEAD window 函数的问题,但我认为这个问题不一定与特定的 SQL 方言有关。
我有一个列出不同客户 (ID) 访问的输入 table 和一个指示 "special visits":
的标志ID | date | special_visit
-------------------------------
A | 2018-01-01 | 0
A | 2018-02-01 | 1
A | 2018-03-01 | 1
B | 2018-01-02 | 0
B | 2018-02-02 | 0
B | 2018-03-02 | 1
我想创建的是以下内容table:
ID | date | special_visit | prev_visit | next_visit | prev_special_visit | next_special_visit
---------------------------------------------------------------------------------------------------
A | 2018-01-01 | 0 | NULL | 2018-02-01 | NULL | 2018-02-01
A | 2018-02-01 | 1 | 2018-01-01 | 2018-03-01 | NULL | 2018-03-01
A | 2018-03-01 | 1 | 2018-02-01 | NULL | 2018-02-01 | NULL
B | 2018-01-02 | 0 | NULL | 2018-02-02 | NULL | 2018-03-02
B | 2018-02-02 | 0 | 2018-01-02 | 2018-03-02 | NULL | 2018-03-02
B | 2018-03-02 | 1 | 2018-02-02 | NULL | NULL | NULL
每次访问都会显示 next/previous 次访问(每次特殊访问也算作 "normal" 次访问)和每个 ID 的 next/previous 次特殊访问。
到目前为止我得到的是以下输出:
ID | date | special_visit | prev_visit | next_visit | prev_special_visit | next_special_visit
---------------------------------------------------------------------------------------------------
A | 2018-01-01 | 0 | NULL | 2018-02-01 | NULL | NULL
A | 2018-02-01 | 1 | 2018-01-01 | 2018-03-01 | NULL | 2018-03-01
A | 2018-03-01 | 1 | 2018-02-01 | NULL | 2018-02-01 | NULL
B | 2018-01-02 | 0 | NULL | 2018-02-02 | NULL | NULL
B | 2018-02-02 | 0 | 2018-01-02 | 2018-03-02 | NULL | NULL
B | 2018-03-02 | 1 | 2018-02-02 | NULL | NULL | NULL
使用此查询:
WITH special_visits AS (
SELECT ID
,date
,LAG(date) OVER (PARTITION BY ID ORDER BY date) AS prev_special_visit
,LEAD(date) OVER (PARTITION BY ID ORDER BY date) AS next_special_visit
FROM input
WHERE special_visit = 1)
SELECT ID
,special_visit
,LAG(date) OVER (PARTITION BY ID ORDER BY date) AS prev_visit
,LEAD(date) OVER (PARTITION BY ID ORDER BY date) AS next_visit
,special_visits.prev_special_visit
,special_visits.next_special_visit
FROM input
LEFT JOIN special_visits USING(ID, date)
这里的问题是,如果观察(行)本身就是特殊访问,我只能观察 previous/next 特殊访问。我希望像这样的 window 函数中的某种过滤器可能起作用:
LAG(date) OVER (PARTITION BY ID ORDER BY date WHERE special_visit = 1) AS prev_special_visit
但不幸的是它不起作用。您知道如何创建所需的输出吗? 非常感谢!
我认为 LEAD()
和 LAG()
不是最佳选择。相反:
SELECT ID, date,
LAG(date) OVER (PARTITION BY ID ORDER BY date) AS prev_visit,
LEAD(date) OVER (PARTITION BY ID ORDER BY date) AS next_visit,
MAX(CASE WHEN special_visit = 1 THEN date END) OVER (PARTITION BY ID ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) as prev_special_visit,
MIN(CASE WHEN special_visit = 1 THEN date END) OVER (PARTITION BY ID ORDER BY DATE ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING) as next_special_visit
FROM input
假设日期总是增加,您可以使用行框架执行条件 min()
s 和 max()
s,如下所示:
select
t.*,
lag(date) over(partition by id order by date) pre_visit,
lead(date) over(partition by id order by date) next_visit,
max(case when special_visit = 1 then date end) over(
partition by id
order by date
row between unbounded preceding and 1 preceeding
) prev_special_visit,
min(case when special_visit = 1 then date end) over(
partition by id
order by date
row between 1 following and unbounded following
) next_special_visit
from input t