Seq.contains in SQL 在 Spark Dataframe 中

Question

我有以下数据结构：

id: int
records: Seq[String]
other: boolean

在json文件中，为了便于测试：

var data = sc.makeRDD(Seq[String](
   "{\"id\":1, \"records\": [\"one\", \"two\", \"three\"], \"other\": true}", 
   "{\"id\": 2, \"records\": [\"two\"], \"other\": true}", 
   "{\"id\": 3, \"records\": [\"one\"], \"other\": false }"))
sqlContext.jsonRDD(data).registerTempTable("temp")

我想过滤掉 records 字段中 one 且 other 等于 true 的记录，仅使用 SQL .

我可以通过 filter 来完成（见下文），但是可以只使用 SQL 来完成吗？

sqlContext
    .sql("select id, records from temp where other = true")
    .rdd.filter(t => t.getAs[Seq[String]]("records").contains("one"))
    .collect()

Answer 1

Spark SQL 支持绝大多数 Hive 功能，因此您可以使用 array_contains 来完成这项工作：

spark.sql("select id, records from temp where other = true and array_contains(records,'one')").show
# +---+-----------------+
# | id|          records|
# +---+-----------------+
# |  1|[one, two, three]|
# +---+-----------------+

注意： 在 spark 1.5 中，sqlContext.jsonRDD 已弃用，请使用以下内容代替：

sqlContext.read.format("json").json(data).registerTempTable("temp")

Seq.contains in SQL 在 Spark Dataframe 中

Seq.contains in SQL in Spark Dataframe

scala

dataframe

apache-spark

apache-spark-sql