Spark SQL CTE 忽略查询中的名称空间
Spark SQL CTE ignoring namespace in query
运行 本地 Spark spark-sql
或通过 pyspark spark.sql(...)
,如果我在查询中使用 CTE,然后使用不正确的命名空间/数据库引用 CTE,查询将正常工作很好(意外)。当我 运行 生产中的查询(在 Databricks 上)时,我收到 Table 或未找到视图错误(预期)。
可以通过 spark-sql
在本地重现意外的传递行为:
WITH myview AS (
SELECT 1 AS column
)
SELECT
*
FROM
invalid_namespace.myview;
哪个 returns 当我预计它会失败时“1”。
谁能帮我在本地解决这个问题,这样我们就可以在部署之前进行适当的测试?
从终端重现的具体步骤:
$ spark-sql
...
spark-sql> WITH some_new_cte AS (SELECT 1 AS column)
> SELECT * FROM namespace_does_not_exist.some_new_cte;
...
1
Time taken: 2.294 seconds, Fetched 1 row(s)
spark-sql>
如果你查看查询计划,它实际上解析失败
== Parsed Logical Plan ==
CTE [myview]
: +- SubqueryAlias `myview`
: +- Project [1 AS column#0]
: +- OneRowRelation
+- 'Project [*]
+- 'UnresolvedRelation `invalid_namespace`.`myview`
== Analyzed Logical Plan ==
column: int
Project [column#0]
+- SubqueryAlias `myview`
+- Project [1 AS column#0]
+- OneRowRelation
== Optimized Logical Plan ==
Project [1 AS column#0]
+- OneRowRelation
== Physical Plan ==
*(1) Project [1 AS column#0]
+- Scan OneRowRelation[]
您的查询返回“1”的原因是因为 spark 看到您的视图在同一个查询中,所以它忽略了您的命名空间。如果命名空间真的不存在,它将失败。
这似乎是一个影响 Spark 2.4.0 到 2.4.5 版本的错误(我没有检查 <2.4.0)。它似乎已在 2.4.6 中得到修复,并在 3.0.0 中继续按预期工作。
2.4.0 中的意外成功:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_262
Branch
Compiled by user on 2018-10-29T06:22:05Z
Revision
Url
Type --help for more information.
$ echo "WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;" | SPARK_CONF_DIR=spark_conf spark-sql
20/07/17 10:59:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark master: local[*], Application Id: local-1595008748636
spark-sql> WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;
1
Time taken: 1.822 seconds, Fetched 1 row(s)
spark-sql>
已测试但未显示:2.4.1、2.4.2、2.4.3、2.4.4(均意外成功)
2.4.5 意外成功:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_262
Branch HEAD
Compiled by user centos on 2020-02-02T19:38:06Z
Revision cee4ecbb16917fa85f02c635925e2687400aa56b
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
$ echo "WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;" | SPARK_CONF_DIR=spark_conf spark-sql
20/07/17 10:59:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark master: local[*], Application Id: local-1595008798737
spark-sql> WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;
1
Time taken: 2.155 seconds, Fetched 1 row(s)
spark-sql>
2.4.6 中的预期失败:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.6
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_262
Branch HEAD
Compiled by user holden on 2020-05-29T23:47:51Z
Revision 807e0a484d1de767d1f02bd8a622da6450bdf940
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
$ echo "WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;" | SPARK_CONF_DIR=spark_conf spark-sql
20/07/17 11:00:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark master: local[*], Application Id: local-1595008843321
spark-sql> WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;
20/07/17 11:00:44 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
20/07/17 11:00:44 WARN ObjectStore: Failed to get database bad_namespace, returning NoSuchObjectException
Error in query: Table or view not found: `bad_namespace`.`mycte`; line 1 pos 49;
'Project [*]
+- 'UnresolvedRelation `bad_namespace`.`mycte`
spark-sql>
3.0.0 中的预期失败:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_262
Branch HEAD
Compiled by user ubuntu on 2020-06-06T11:32:25Z
Revision 3fdfce3120f307147244e5eaf46d61419a723d50
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
$ echo "WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;" | SPARK_CONF_DIR=spark_conf spark-sql
20/07/17 11:01:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/07/17 11:01:42 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
20/07/17 11:01:42 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
20/07/17 11:01:44 ERROR ObjectStore: Version information found in metastore differs 1.2.0 from expected schema version 2.3.0. Schema verififcation is disabled hive.metastore.schema.verification
20/07/17 11:01:44 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore georgeleslie-waksman@10.0.1.178
Spark master: local[*], Application Id: local-1595008901594
spark-sql> WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;
20/07/17 11:01:45 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
20/07/17 11:01:45 WARN ObjectStore: Failed to get database bad_namespace, returning NoSuchObjectException
Error in query: Table or view not found: bad_namespace.mycte; line 1 pos 49;
'Project [*]
+- 'UnresolvedRelation [bad_namespace, mycte]
spark-sql>
运行 本地 Spark spark-sql
或通过 pyspark spark.sql(...)
,如果我在查询中使用 CTE,然后使用不正确的命名空间/数据库引用 CTE,查询将正常工作很好(意外)。当我 运行 生产中的查询(在 Databricks 上)时,我收到 Table 或未找到视图错误(预期)。
可以通过 spark-sql
在本地重现意外的传递行为:
WITH myview AS (
SELECT 1 AS column
)
SELECT
*
FROM
invalid_namespace.myview;
哪个 returns 当我预计它会失败时“1”。
谁能帮我在本地解决这个问题,这样我们就可以在部署之前进行适当的测试?
从终端重现的具体步骤:
$ spark-sql
...
spark-sql> WITH some_new_cte AS (SELECT 1 AS column)
> SELECT * FROM namespace_does_not_exist.some_new_cte;
...
1
Time taken: 2.294 seconds, Fetched 1 row(s)
spark-sql>
如果你查看查询计划,它实际上解析失败
== Parsed Logical Plan ==
CTE [myview]
: +- SubqueryAlias `myview`
: +- Project [1 AS column#0]
: +- OneRowRelation
+- 'Project [*]
+- 'UnresolvedRelation `invalid_namespace`.`myview`
== Analyzed Logical Plan ==
column: int
Project [column#0]
+- SubqueryAlias `myview`
+- Project [1 AS column#0]
+- OneRowRelation
== Optimized Logical Plan ==
Project [1 AS column#0]
+- OneRowRelation
== Physical Plan ==
*(1) Project [1 AS column#0]
+- Scan OneRowRelation[]
您的查询返回“1”的原因是因为 spark 看到您的视图在同一个查询中,所以它忽略了您的命名空间。如果命名空间真的不存在,它将失败。
这似乎是一个影响 Spark 2.4.0 到 2.4.5 版本的错误(我没有检查 <2.4.0)。它似乎已在 2.4.6 中得到修复,并在 3.0.0 中继续按预期工作。
2.4.0 中的意外成功:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_262
Branch
Compiled by user on 2018-10-29T06:22:05Z
Revision
Url
Type --help for more information.
$ echo "WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;" | SPARK_CONF_DIR=spark_conf spark-sql
20/07/17 10:59:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark master: local[*], Application Id: local-1595008748636
spark-sql> WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;
1
Time taken: 1.822 seconds, Fetched 1 row(s)
spark-sql>
已测试但未显示:2.4.1、2.4.2、2.4.3、2.4.4(均意外成功)
2.4.5 意外成功:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_262
Branch HEAD
Compiled by user centos on 2020-02-02T19:38:06Z
Revision cee4ecbb16917fa85f02c635925e2687400aa56b
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
$ echo "WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;" | SPARK_CONF_DIR=spark_conf spark-sql
20/07/17 10:59:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark master: local[*], Application Id: local-1595008798737
spark-sql> WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;
1
Time taken: 2.155 seconds, Fetched 1 row(s)
spark-sql>
2.4.6 中的预期失败:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.6
/_/
Using Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_262
Branch HEAD
Compiled by user holden on 2020-05-29T23:47:51Z
Revision 807e0a484d1de767d1f02bd8a622da6450bdf940
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
$ echo "WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;" | SPARK_CONF_DIR=spark_conf spark-sql
20/07/17 11:00:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark master: local[*], Application Id: local-1595008843321
spark-sql> WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;
20/07/17 11:00:44 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
20/07/17 11:00:44 WARN ObjectStore: Failed to get database bad_namespace, returning NoSuchObjectException
Error in query: Table or view not found: `bad_namespace`.`mycte`; line 1 pos 49;
'Project [*]
+- 'UnresolvedRelation `bad_namespace`.`mycte`
spark-sql>
3.0.0 中的预期失败:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_262
Branch HEAD
Compiled by user ubuntu on 2020-06-06T11:32:25Z
Revision 3fdfce3120f307147244e5eaf46d61419a723d50
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
$ echo "WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;" | SPARK_CONF_DIR=spark_conf spark-sql
20/07/17 11:01:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/07/17 11:01:42 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
20/07/17 11:01:42 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
20/07/17 11:01:44 ERROR ObjectStore: Version information found in metastore differs 1.2.0 from expected schema version 2.3.0. Schema verififcation is disabled hive.metastore.schema.verification
20/07/17 11:01:44 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore georgeleslie-waksman@10.0.1.178
Spark master: local[*], Application Id: local-1595008901594
spark-sql> WITH mycte AS (SELECT 1 AS column) SELECT * FROM bad_namespace.mycte;
20/07/17 11:01:45 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
20/07/17 11:01:45 WARN ObjectStore: Failed to get database bad_namespace, returning NoSuchObjectException
Error in query: Table or view not found: bad_namespace.mycte; line 1 pos 49;
'Project [*]
+- 'UnresolvedRelation [bad_namespace, mycte]
spark-sql>