有没有办法将此 BigQuery 自联接更改为使用 window 函数?
Is there a way to change this BigQuery self-join to use a window function?
假设我有一个 BigQuery table "events"(实际上这是一个缓慢的子查询),它按事件类型存储每天的事件计数。事件的类型有很多种,其中大部分在大多数日子都不会发生,因此只有 day/event 类型组合的行具有非零计数。
我有一个查询 returns 每个事件类型和日期的计数以及 N 天前该事件的计数,如下所示:
WITH events AS (
SELECT DATE('2019-06-08') AS day, 'a' AS type, 1 AS count
UNION ALL SELECT '2019-06-09', 'a', 2
UNION ALL SELECT '2019-06-10', 'a', 3
UNION ALL SELECT '2019-06-07', 'b', 4
UNION ALL SELECT '2019-06-09', 'b', 5
)
SELECT e1.type, e1.day, e1.count, COALESCE(e2.count, 0) AS prev_count
FROM events e1
LEFT JOIN events e2 ON e1.type = e2.type AND e1.day = DATE_ADD(e2.day, INTERVAL 2 DAY) -- LEFT JOIN, because the event may not have occurred at all 2 days ago
ORDER BY 1, 2
查询速度慢。 BigQuery best practices 建议使用 window 函数而不是自联接。有没有办法在这里做到这一点?如果每天都有一行,我可以使用 LAG
函数,但没有。我可以 "pad" 以某种方式吗? (没有可能的事件类型的简短列表。我当然可以加入 SELECT DISTINCT type FROM events
,但这可能不会比自加入更快。)
暴力破解方法是:
select e.*,
(case when lag(day) over (partition by type order by date) = dateadd(e.day, interval -2 day)
then lag(cnt) over (partition by type order by date)
when lag(day, 2) over (partition by type order by date) = dateadd(e.day, interval -2 day)
then lag(cnt, 2) over (partition by type order by date)
end) as prev_day2_count
from events e;
两天的滞后效果很好。延迟越长,它就越麻烦。
编辑:
更通用的形式使用 window 帧。不幸的是,这些必须是数字,所以还有一个额外的步骤:
select e.*,
(case when min(day) over (partition by type order by diff range between 2 preceding and current day) = date_add(day, interval -2 day)
then first_value(cnt) over (partition by type order by diff range between 2 preceding and current day)
end)
from (select e.*,
date_diff(day, max(day) over (partition by type), day) as diff -- day is a bad name for a column because it is a date part
from events e
) e;
呃! case
表达式不是必需的:
select e.*,
first_value(cnt) over (partition by type order by diff range between 2 preceding and 2 preceding)
from (select e.*,
date_diff(day, max(day) over (partition by type), day) as diff -- day is a bad name for a column because it is a date part
from events e
) e;
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT *, IFNULL(FIRST_VALUE(count) OVER (win), 0) prev_count
FROM `project.dataset.events`
WINDOW win AS (PARTITION BY type ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND 2 PRECEDING)
如果适用于您问题中的样本数据 - 结果是:
Row day type count prev_count
1 2019-06-08 a 1 0
2 2019-06-09 a 2 0
3 2019-06-10 a 3 1
4 2019-06-07 b 4 0
5 2019-06-09 b 5 4
假设我有一个 BigQuery table "events"(实际上这是一个缓慢的子查询),它按事件类型存储每天的事件计数。事件的类型有很多种,其中大部分在大多数日子都不会发生,因此只有 day/event 类型组合的行具有非零计数。
我有一个查询 returns 每个事件类型和日期的计数以及 N 天前该事件的计数,如下所示:
WITH events AS (
SELECT DATE('2019-06-08') AS day, 'a' AS type, 1 AS count
UNION ALL SELECT '2019-06-09', 'a', 2
UNION ALL SELECT '2019-06-10', 'a', 3
UNION ALL SELECT '2019-06-07', 'b', 4
UNION ALL SELECT '2019-06-09', 'b', 5
)
SELECT e1.type, e1.day, e1.count, COALESCE(e2.count, 0) AS prev_count
FROM events e1
LEFT JOIN events e2 ON e1.type = e2.type AND e1.day = DATE_ADD(e2.day, INTERVAL 2 DAY) -- LEFT JOIN, because the event may not have occurred at all 2 days ago
ORDER BY 1, 2
查询速度慢。 BigQuery best practices 建议使用 window 函数而不是自联接。有没有办法在这里做到这一点?如果每天都有一行,我可以使用 LAG
函数,但没有。我可以 "pad" 以某种方式吗? (没有可能的事件类型的简短列表。我当然可以加入 SELECT DISTINCT type FROM events
,但这可能不会比自加入更快。)
暴力破解方法是:
select e.*,
(case when lag(day) over (partition by type order by date) = dateadd(e.day, interval -2 day)
then lag(cnt) over (partition by type order by date)
when lag(day, 2) over (partition by type order by date) = dateadd(e.day, interval -2 day)
then lag(cnt, 2) over (partition by type order by date)
end) as prev_day2_count
from events e;
两天的滞后效果很好。延迟越长,它就越麻烦。
编辑:
更通用的形式使用 window 帧。不幸的是,这些必须是数字,所以还有一个额外的步骤:
select e.*,
(case when min(day) over (partition by type order by diff range between 2 preceding and current day) = date_add(day, interval -2 day)
then first_value(cnt) over (partition by type order by diff range between 2 preceding and current day)
end)
from (select e.*,
date_diff(day, max(day) over (partition by type), day) as diff -- day is a bad name for a column because it is a date part
from events e
) e;
呃! case
表达式不是必需的:
select e.*,
first_value(cnt) over (partition by type order by diff range between 2 preceding and 2 preceding)
from (select e.*,
date_diff(day, max(day) over (partition by type), day) as diff -- day is a bad name for a column because it is a date part
from events e
) e;
以下适用于 BigQuery 标准 SQL
#standardSQL
SELECT *, IFNULL(FIRST_VALUE(count) OVER (win), 0) prev_count
FROM `project.dataset.events`
WINDOW win AS (PARTITION BY type ORDER BY UNIX_DATE(day) RANGE BETWEEN 2 PRECEDING AND 2 PRECEDING)
如果适用于您问题中的样本数据 - 结果是:
Row day type count prev_count
1 2019-06-08 a 1 0
2 2019-06-09 a 2 0
3 2019-06-10 a 3 1
4 2019-06-07 b 4 0
5 2019-06-09 b 5 4