SparkSQL "CASE WHEN THEN" 在 pyspark 中有两个 table 列
SparkSQL "CASE WHEN THEN" with two table columns in pyspark
我有两个临时表 table_a
和 table_b
并试图让这个查询及其所有条件正常工作。
SELECT DISTINCT CASE WHEN a.id IS NULL THEN b.id ELSE a.id END id,
CASE WHEN a.num IS NULL THEN b.num ELSE a.num END num,
CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END testdate
FROM table_a a
FULL OUTER JOIN table_b b
ON (a.id=b.id AND a.num=b.num AND a.testdate=b.testdate)
WHERE
(CASE WHEN a.t_amt IS NULL THEN 0 ELSE a.t_amt END)
<>
(CASE WHEN b.t_amt IS NULL THEN 0 ELSE b.t_amt END) OR
(CASE WHEN a.qty IS NULL THEN 0 ELSE a.qty END)
<>
(CASE WHEN b.qty IS NULL THEN 0 ELSE b.qty END)
ORDER BY
CASE WHEN a.id IS NULL THEN b.id ELSE a.id END,
CASE WHEN a.num IS NULL THEN b.num ELSE a.num END,
CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END
使用 SparkSQL
对这两个表执行上述查询会产生以下错误
sqlq = <the sql from above>
df = sqlContext.sql(sqlq)
"AnalysisException: u"cannot resolve 'a.id
' given input columns: [id, num, testdate];"
您的错误似乎在 ORDER BY
子句中,因为它没有表 a
和 b
的概念,只有 [=14= 中的名称和别名] 条款。
这非常有意义,因为您实际上应该只根据结果集中实际的列对结果进行排序。
SELECT DISTINCT (CASE WHEN a.id IS NULL THEN b.id ELSE a.id END) AS id,
(CASE WHEN a.num IS NULL THEN b.num ELSE a.num END) AS num,
(CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END) AS testdate
FROM table_a AS a
FULL OUTER JOIN table_b AS b
ON (a.id=b.id AND a.num=b.num AND a.testdate=b.testdate)
WHERE
(CASE WHEN a.t_amt IS NULL THEN 0 ELSE a.t_amt END) <> (CASE WHEN b.t_amt IS NULL THEN 0 ELSE b.t_amt END)
OR
(CASE WHEN a.qty IS NULL THEN 0 ELSE a.qty END) <> (CASE WHEN b.qty IS NULL THEN 0 ELSE b.qty END)
ORDER BY id, num, testdate
我有两个临时表 table_a
和 table_b
并试图让这个查询及其所有条件正常工作。
SELECT DISTINCT CASE WHEN a.id IS NULL THEN b.id ELSE a.id END id,
CASE WHEN a.num IS NULL THEN b.num ELSE a.num END num,
CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END testdate
FROM table_a a
FULL OUTER JOIN table_b b
ON (a.id=b.id AND a.num=b.num AND a.testdate=b.testdate)
WHERE
(CASE WHEN a.t_amt IS NULL THEN 0 ELSE a.t_amt END)
<>
(CASE WHEN b.t_amt IS NULL THEN 0 ELSE b.t_amt END) OR
(CASE WHEN a.qty IS NULL THEN 0 ELSE a.qty END)
<>
(CASE WHEN b.qty IS NULL THEN 0 ELSE b.qty END)
ORDER BY
CASE WHEN a.id IS NULL THEN b.id ELSE a.id END,
CASE WHEN a.num IS NULL THEN b.num ELSE a.num END,
CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END
使用 SparkSQL
对这两个表执行上述查询会产生以下错误
sqlq = <the sql from above>
df = sqlContext.sql(sqlq)
"AnalysisException: u"cannot resolve '
a.id
' given input columns: [id, num, testdate];"
您的错误似乎在 ORDER BY
子句中,因为它没有表 a
和 b
的概念,只有 [=14= 中的名称和别名] 条款。
这非常有意义,因为您实际上应该只根据结果集中实际的列对结果进行排序。
SELECT DISTINCT (CASE WHEN a.id IS NULL THEN b.id ELSE a.id END) AS id,
(CASE WHEN a.num IS NULL THEN b.num ELSE a.num END) AS num,
(CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END) AS testdate
FROM table_a AS a
FULL OUTER JOIN table_b AS b
ON (a.id=b.id AND a.num=b.num AND a.testdate=b.testdate)
WHERE
(CASE WHEN a.t_amt IS NULL THEN 0 ELSE a.t_amt END) <> (CASE WHEN b.t_amt IS NULL THEN 0 ELSE b.t_amt END)
OR
(CASE WHEN a.qty IS NULL THEN 0 ELSE a.qty END) <> (CASE WHEN b.qty IS NULL THEN 0 ELSE b.qty END)
ORDER BY id, num, testdate