SparkSQL "CASE WHEN THEN" 在 pyspark 中有两个 table 列

Question

我有两个临时表 table_a 和 table_b 并试图让这个查询及其所有条件正常工作。

SELECT DISTINCT CASE WHEN a.id IS NULL THEN b.id ELSE a.id END id,
    CASE WHEN a.num IS NULL THEN b.num ELSE a.num END num,
    CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END testdate
FROM table_a a
    FULL OUTER JOIN table_b b
    ON (a.id=b.id AND a.num=b.num AND a.testdate=b.testdate)
WHERE
    (CASE WHEN a.t_amt IS NULL THEN 0 ELSE a.t_amt END)
    <>
    (CASE WHEN b.t_amt IS NULL THEN 0 ELSE b.t_amt END) OR
    (CASE WHEN a.qty IS NULL THEN 0 ELSE a.qty END)
    <>
    (CASE WHEN b.qty IS NULL THEN 0 ELSE b.qty END)
ORDER BY
    CASE WHEN a.id IS NULL THEN b.id ELSE a.id END,
    CASE WHEN a.num IS NULL THEN b.num ELSE a.num END,
    CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END

使用 SparkSQL 对这两个表执行上述查询会产生以下错误

sqlq = <the sql from above>

df = sqlContext.sql(sqlq)

"AnalysisException: u"cannot resolve 'a.id' given input columns: [id, num, testdate];"

Answer 1

您的错误似乎在 ORDER BY 子句中，因为它没有表 a 和 b 的概念，只有 [=14= 中的名称和别名] 条款。
这非常有意义，因为您实际上应该只根据结果集中实际的列对结果进行排序。

SELECT DISTINCT (CASE WHEN a.id IS NULL THEN b.id ELSE a.id END) AS id,
    (CASE WHEN a.num IS NULL THEN b.num ELSE a.num END) AS num,
    (CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END) AS testdate
FROM table_a AS a
    FULL OUTER JOIN table_b AS b
    ON (a.id=b.id AND a.num=b.num AND a.testdate=b.testdate)
WHERE
    (CASE WHEN a.t_amt IS NULL THEN 0 ELSE a.t_amt END) <> (CASE WHEN b.t_amt IS NULL THEN 0 ELSE b.t_amt END)
    OR
    (CASE WHEN a.qty IS NULL THEN 0 ELSE a.qty END) <> (CASE WHEN b.qty IS NULL THEN 0 ELSE b.qty END)
ORDER BY id, num, testdate

SparkSQL "CASE WHEN THEN" 在 pyspark 中有两个 table 列

SparkSQL "CASE WHEN THEN" with two table columns in pyspark

apache-spark

apache-spark-sql

spark-dataframe

pyspark-sql