SPARK DataFrame：使用 group by 的 hive max(case) 语句的替代 scala 代码

Question

我有四列的数据框（id int，name String，mobile String，phone String）

我需要替代方法来将 Hive 查询中的逻辑实现到 Scala 代码。

Hive 查询是：

SELECT id AS member_id
,max(CASE WHEN name = 'Mrs.' THEN mobile ELSE NULL END) AS mobile
,max(CASE WHEN name = 'Dr.' THEN phone ELSE NULL END) AS phone
from temp1
group by id;

谢谢。

Answer 1

你可以这样写：

dataFrame.registerTempTable("temp1")
val result = sqlContext.sql (here put same SQL as in question)

或者在 Spark 2.0 中它将是：

dataset.createTempView("temp1")
val result = sparkSession.sql(here put same SQL as in question)

或者，您可以使用数据集 API:

val mobileUDF = udf {
    (name : String, mobile : String) => if (name == "Mrs.") mobile else null;
}
val phoneUDF = udf {
    (name : String, phone: String) => if (name == "Mrs.") phone else null;
}

dataset.withColumn("newMobile", mobileUDF($"name", $"mobile"))
    .withColumn("newPhone", phoneUDF($"name", $"phone"))
    .groupBy($"id")
    .agg(max(col("newMobile")), max(col("newPhone")))

Answer 2

尝试：

df.groupBy('id).agg(
  max(when('name === "Mrs.", 'mobile)).alias("mobile"),
  max(when('name === "Dr.", 'phone)).alias("phone")
)

SPARK DataFrame：使用 group by 的 hive max(case) 语句的替代 scala 代码

SPARK DataFrame: alternative scala code for hive max(case) statement with group by

hive

apache-spark

spark-dataframe