如何使用子查询和地图查找进行高阶函数转换?
How to do higher order function transform with sub query and a map lookup?
这是我之前
的后续问题
scala> val map1 = spark.sql("select map('s1', 'p1', 's2', 'p2', 's3', 'p3') as lookup")
map1: org.apache.spark.sql.DataFrame = [lookup: map<string,string>]
scala> val ds1 = spark.sql("select 'p1' as p, Array('s2','s3') as c")
ds1: org.apache.spark.sql.DataFrame = [p: string, c: array]
scala> ds1.createOrReplaceTempView("ds1")
scala> map1.createOrReplaceTempView("map1")
scala> map1.show()
+--------------------+
| lookup|
+--------------------+
|[p1 -> s1, p2 -> ...|
+--------------------+
scala> ds1.show()
+---+--------+
| p| c|
+---+--------+
| p1|[s2, s3]|
+---+--------+
map1.selectExpr("element_at(`lookup`, 's2')").first()
res50: org.apache.spark.sql.Row = [p2]
scala> spark.sql("select element_at(`lookup`, 's1') from map1").show()
+----------------------+
|element_at(lookup, s1)|
+----------------------+
| p1|
+----------------------+
到目前为止一切顺利。在接下来的两个步骤中,我遇到了一些问题:
scala> ds1.selectExpr("p", "c", "transform(c, cs -> map1.selectExpr('element_at(`lookup`, cs)')) as cs").show()
20/09/28 19:44:59 WARN HiveConf: HiveConf of name
hive.stats.jdbc.timeout does not exist 20/09/28 19:44:59 WARN
HiveConf: HiveConf of name hive.stats.retries.wait does not exist
20/09/28 19:45:03 WARN ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so
recording the schema version 2.3.0 20/09/28 19:45:03 WARN ObjectStore:
setMetaStoreSchemaVersion called but recording version is disabled:
version = 2.3.0, comment = Set by MetaStore root@10.1.21.76 20/09/28
19:45:03 WARN ObjectStore: Failed to get database map1, returning
NoSuchObjectException org.apache.spark.sql.AnalysisException:
Undefined function: 'selectExpr'. This function is neither a
registered temporary function nor a permanent function registered in
the database 'map1'.; line 1 pos 19
scala> spark.sql("""select p, c, transform(c, cs -> (select element_at(`lookup`, cs) from map1)) cc from ds1""").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'cs
' given
input columns: [map1.lookup]; line 1 pos 61; 'Project [p#329, c#330,
transform(c#330, lambdafunction(scalar-subquery#713 [], lambda cs#715,
false)) AS cc#714] : +- 'Project
[unresolvedalias('element_at(lookup#327, 'cs), None)] : +-
SubqueryAlias map1 : +- Project [map(s1, p1, s2, p2, s3, p3) AS
lookup#327] : +- OneRowRelation
+- SubqueryAlias ds1 +- Project [p1 AS p#329, array(s2, s3) AS c#330]
+- OneRowRelatio
我该如何解决这些问题?
如果 map1
没有太多行,您可以对从 c
列的数组中提取的所有值的集合进行交叉连接。
spark.sql("select col as value, element_at(map1.lookup, col) as key +
"from (select explode(ds1.c) from ds1) as v cross join map1")
结果(将上述赋值给DataFrame类型的值,并调用.show
):
+-----+---+
|value|key|
+-----+---+
| s2| p2|
| s3| p3|
+-----+---+
只需将 table 名称添加到 from
子句即可。
spark.sql("""select p, c, transform(c, cs -> element_at(`lookup`, cs)) cc from ds1 a, map1 b""").show()
+---+--------+--------+
| p| c| cc|
+---+--------+--------+
| p1|[s2, s3]|[p2, p3]|
+---+--------+--------+
这是我之前
scala> val map1 = spark.sql("select map('s1', 'p1', 's2', 'p2', 's3', 'p3') as lookup")
map1: org.apache.spark.sql.DataFrame = [lookup: map<string,string>]
scala> val ds1 = spark.sql("select 'p1' as p, Array('s2','s3') as c")
ds1: org.apache.spark.sql.DataFrame = [p: string, c: array]
scala> ds1.createOrReplaceTempView("ds1")
scala> map1.createOrReplaceTempView("map1")
scala> map1.show()
+--------------------+
| lookup|
+--------------------+
|[p1 -> s1, p2 -> ...|
+--------------------+
scala> ds1.show()
+---+--------+
| p| c|
+---+--------+
| p1|[s2, s3]|
+---+--------+
map1.selectExpr("element_at(`lookup`, 's2')").first()
res50: org.apache.spark.sql.Row = [p2]
scala> spark.sql("select element_at(`lookup`, 's1') from map1").show()
+----------------------+
|element_at(lookup, s1)|
+----------------------+
| p1|
+----------------------+
到目前为止一切顺利。在接下来的两个步骤中,我遇到了一些问题:
scala> ds1.selectExpr("p", "c", "transform(c, cs -> map1.selectExpr('element_at(`lookup`, cs)')) as cs").show()
20/09/28 19:44:59 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 20/09/28 19:44:59 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 20/09/28 19:45:03 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 20/09/28 19:45:03 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore root@10.1.21.76 20/09/28 19:45:03 WARN ObjectStore: Failed to get database map1, returning NoSuchObjectException org.apache.spark.sql.AnalysisException: Undefined function: 'selectExpr'. This function is neither a registered temporary function nor a permanent function registered in the database 'map1'.; line 1 pos 19
scala> spark.sql("""select p, c, transform(c, cs -> (select element_at(`lookup`, cs) from map1)) cc from ds1""").show()
org.apache.spark.sql.AnalysisException: cannot resolve '
cs
' given input columns: [map1.lookup]; line 1 pos 61; 'Project [p#329, c#330, transform(c#330, lambdafunction(scalar-subquery#713 [], lambda cs#715, false)) AS cc#714] : +- 'Project [unresolvedalias('element_at(lookup#327, 'cs), None)] : +- SubqueryAlias map1 : +- Project [map(s1, p1, s2, p2, s3, p3) AS lookup#327] : +- OneRowRelation +- SubqueryAlias ds1 +- Project [p1 AS p#329, array(s2, s3) AS c#330] +- OneRowRelatio
我该如何解决这些问题?
如果 map1
没有太多行,您可以对从 c
列的数组中提取的所有值的集合进行交叉连接。
spark.sql("select col as value, element_at(map1.lookup, col) as key +
"from (select explode(ds1.c) from ds1) as v cross join map1")
结果(将上述赋值给DataFrame类型的值,并调用.show
):
+-----+---+
|value|key|
+-----+---+
| s2| p2|
| s3| p3|
+-----+---+
只需将 table 名称添加到 from
子句即可。
spark.sql("""select p, c, transform(c, cs -> element_at(`lookup`, cs)) cc from ds1 a, map1 b""").show()
+---+--------+--------+
| p| c| cc|
+---+--------+--------+
| p1|[s2, s3]|[p2, p3]|
+---+--------+--------+