Databricks 和 Spark 中的常见 Table 表达式 (CTE)
Common Table Expressions (CTEs) in Databricks and Spark
我在 Databricks 中有一个 spark 数据框。我正在尝试 运行 一些 sql 使用通用 Table 表达式 (CTE) 的查询。这是前 10 行数据
+----------+----------+------+---+---+---------+-----------------+
| data_date| user_id|region|sex|age|age_group|sum(duration_min)|
+----------+----------+------+---+---+---------+-----------------+
|2020-01-01|22600560aa| 1| 1| 28| 2| 0.0|
|2020-01-01|17148900ab| 6| 2| 60| 5| 1138.0|
|2020-01-01|21900230aa| 5| 1| 43| 4| 0.0|
|2020-01-01|35900050ac| 8| 1| 16| 1| 224.0|
|2020-01-01|22300280ad| 6| 2| 44| 4| 8.0|
|2020-01-02|19702160ac| 2| 2| 55| 5| 0.0|
|2020-02-02|17900020aa| 5| 2| 64| 5| 264.0|
|2020-02-02|16900120aa| 3| 1| 69| 6| 0.0|
|2020-02-02|11160900aa| 6| 2| 52| 5| 0.0|
|2020-03-02|16900290aa| 5| 1| 37| 3| 0.0|
+----------+----------+------+---+---+---------+-----------------+
这里我在regs CTE中存储了每个用户的注册日期,然后计算每个月的注册数量。这个带有 CTE 的块在 Databricks 中没有任何问题
%sql
WITH regs AS (
SELECT
user_id,
MIN(data_date) AS reg_date
FROM df2
GROUP BY user_id)
SELECT
month(reg_date) AS reg_month,
COUNT(DISTINCT user_id) AS users
FROM regs
GROUP BY reg_month
ORDER BY reg_month ASC;
然而,当我将另一个 CTE 添加到我之前的 sql 查询中时,它 returns 出现错误(我在 sql 服务器中测试了这个块并且它工作正常)。我无法弄清楚为什么不能在 spark databricks 中工作。
%sql
WITH regs AS (
SELECT
user_id,
MIN(data_date) AS reg_date
FROM df2
GROUP BY user_id
),
regs_per_month AS (
SELECT
month(reg_date) AS reg_month,
COUNT(DISTINCT user_id) AS users
FROM regs
GROUP BY reg_month
)
SELECT
reg_month,
users,
LAG(users, 1) OVER (ORDER BY regs_per_month ASC) AS previous_users
FROM regs_per_month
ORDER BY reg_month ASC;
这是错误信息
Error in SQL statement: AnalysisException: cannot resolve '`regs_per_month`' given input columns: [regs_per_month.reg_month, regs_per_month.users]; line 20 pos 31;
'Sort ['reg_month ASC NULLS FIRST], true
您可以在 Spark SQL 中嵌套常见的 table 表达式 (CTE),只需使用逗号即可,例如
%sql
;WITH regs AS (
SELECT
user_id,
MIN(data_date) AS reg_date
FROM df2
GROUP BY user_id
),
regs_per_month AS (
SELECT
month(reg_date) AS reg_month,
COUNT(DISTINCT user_id) AS users
FROM regs
GROUP BY reg_month
)
SELECT
reg_month,
users,
LAG(users, 1) OVER (ORDER BY reg_month ASC) AS previous_users
FROM regs_per_month
ORDER BY reg_month ASC;
我的结果:
如前所述,您的 LAG
语句应引用 reg_month
列而不是 regs_per_month
CTE。
作为嵌套 CTE 的替代方法,您可以使用多个 WITH
语句,例如
%sql
;WITH regs_per_month AS (
WITH regs AS (
SELECT
user_id,
MIN(data_date) AS reg_date
FROM df2
GROUP BY user_id
)
SELECT
month(reg_date) AS reg_month,
COUNT(DISTINCT user_id) AS users
FROM regs
GROUP BY reg_month
)
SELECT
reg_month,
users,
LAG( users, 1 ) OVER ( ORDER BY reg_month ASC ) AS previous_users
FROM regs_per_month
ORDER BY reg_month ASC;
我在 Databricks 中有一个 spark 数据框。我正在尝试 运行 一些 sql 使用通用 Table 表达式 (CTE) 的查询。这是前 10 行数据
+----------+----------+------+---+---+---------+-----------------+
| data_date| user_id|region|sex|age|age_group|sum(duration_min)|
+----------+----------+------+---+---+---------+-----------------+
|2020-01-01|22600560aa| 1| 1| 28| 2| 0.0|
|2020-01-01|17148900ab| 6| 2| 60| 5| 1138.0|
|2020-01-01|21900230aa| 5| 1| 43| 4| 0.0|
|2020-01-01|35900050ac| 8| 1| 16| 1| 224.0|
|2020-01-01|22300280ad| 6| 2| 44| 4| 8.0|
|2020-01-02|19702160ac| 2| 2| 55| 5| 0.0|
|2020-02-02|17900020aa| 5| 2| 64| 5| 264.0|
|2020-02-02|16900120aa| 3| 1| 69| 6| 0.0|
|2020-02-02|11160900aa| 6| 2| 52| 5| 0.0|
|2020-03-02|16900290aa| 5| 1| 37| 3| 0.0|
+----------+----------+------+---+---+---------+-----------------+
这里我在regs CTE中存储了每个用户的注册日期,然后计算每个月的注册数量。这个带有 CTE 的块在 Databricks 中没有任何问题
%sql
WITH regs AS (
SELECT
user_id,
MIN(data_date) AS reg_date
FROM df2
GROUP BY user_id)
SELECT
month(reg_date) AS reg_month,
COUNT(DISTINCT user_id) AS users
FROM regs
GROUP BY reg_month
ORDER BY reg_month ASC;
然而,当我将另一个 CTE 添加到我之前的 sql 查询中时,它 returns 出现错误(我在 sql 服务器中测试了这个块并且它工作正常)。我无法弄清楚为什么不能在 spark databricks 中工作。
%sql
WITH regs AS (
SELECT
user_id,
MIN(data_date) AS reg_date
FROM df2
GROUP BY user_id
),
regs_per_month AS (
SELECT
month(reg_date) AS reg_month,
COUNT(DISTINCT user_id) AS users
FROM regs
GROUP BY reg_month
)
SELECT
reg_month,
users,
LAG(users, 1) OVER (ORDER BY regs_per_month ASC) AS previous_users
FROM regs_per_month
ORDER BY reg_month ASC;
这是错误信息
Error in SQL statement: AnalysisException: cannot resolve '`regs_per_month`' given input columns: [regs_per_month.reg_month, regs_per_month.users]; line 20 pos 31;
'Sort ['reg_month ASC NULLS FIRST], true
您可以在 Spark SQL 中嵌套常见的 table 表达式 (CTE),只需使用逗号即可,例如
%sql
;WITH regs AS (
SELECT
user_id,
MIN(data_date) AS reg_date
FROM df2
GROUP BY user_id
),
regs_per_month AS (
SELECT
month(reg_date) AS reg_month,
COUNT(DISTINCT user_id) AS users
FROM regs
GROUP BY reg_month
)
SELECT
reg_month,
users,
LAG(users, 1) OVER (ORDER BY reg_month ASC) AS previous_users
FROM regs_per_month
ORDER BY reg_month ASC;
我的结果:
如前所述,您的 LAG
语句应引用 reg_month
列而不是 regs_per_month
CTE。
作为嵌套 CTE 的替代方法,您可以使用多个 WITH
语句,例如
%sql
;WITH regs_per_month AS (
WITH regs AS (
SELECT
user_id,
MIN(data_date) AS reg_date
FROM df2
GROUP BY user_id
)
SELECT
month(reg_date) AS reg_month,
COUNT(DISTINCT user_id) AS users
FROM regs
GROUP BY reg_month
)
SELECT
reg_month,
users,
LAG( users, 1 ) OVER ( ORDER BY reg_month ASC ) AS previous_users
FROM regs_per_month
ORDER BY reg_month ASC;