SPARK DataFrame:使用 group by 的 hive max(case) 语句的替代 scala 代码
SPARK DataFrame: alternative scala code for hive max(case) statement with group by
我有四列的数据框(id int,name String,mobile String,phone String)
我需要替代方法来将 Hive 查询中的逻辑实现到 Scala 代码。
Hive 查询是:
SELECT id AS member_id
,max(CASE WHEN name = 'Mrs.' THEN mobile ELSE NULL END) AS mobile
,max(CASE WHEN name = 'Dr.' THEN phone ELSE NULL END) AS phone
from temp1
group by id;
谢谢。
你可以这样写:
dataFrame.registerTempTable("temp1")
val result = sqlContext.sql (here put same SQL as in question)
或者在 Spark 2.0 中它将是:
dataset.createTempView("temp1")
val result = sparkSession.sql(here put same SQL as in question)
或者,您可以使用数据集 API:
val mobileUDF = udf {
(name : String, mobile : String) => if (name == "Mrs.") mobile else null;
}
val phoneUDF = udf {
(name : String, phone: String) => if (name == "Mrs.") phone else null;
}
dataset.withColumn("newMobile", mobileUDF($"name", $"mobile"))
.withColumn("newPhone", phoneUDF($"name", $"phone"))
.groupBy($"id")
.agg(max(col("newMobile")), max(col("newPhone")))
尝试:
df.groupBy('id).agg(
max(when('name === "Mrs.", 'mobile)).alias("mobile"),
max(when('name === "Dr.", 'phone)).alias("phone")
)
我有四列的数据框(id int,name String,mobile String,phone String)
我需要替代方法来将 Hive 查询中的逻辑实现到 Scala 代码。
Hive 查询是:
SELECT id AS member_id
,max(CASE WHEN name = 'Mrs.' THEN mobile ELSE NULL END) AS mobile
,max(CASE WHEN name = 'Dr.' THEN phone ELSE NULL END) AS phone
from temp1
group by id;
谢谢。
你可以这样写:
dataFrame.registerTempTable("temp1")
val result = sqlContext.sql (here put same SQL as in question)
或者在 Spark 2.0 中它将是:
dataset.createTempView("temp1")
val result = sparkSession.sql(here put same SQL as in question)
或者,您可以使用数据集 API:
val mobileUDF = udf {
(name : String, mobile : String) => if (name == "Mrs.") mobile else null;
}
val phoneUDF = udf {
(name : String, phone: String) => if (name == "Mrs.") phone else null;
}
dataset.withColumn("newMobile", mobileUDF($"name", $"mobile"))
.withColumn("newPhone", phoneUDF($"name", $"phone"))
.groupBy($"id")
.agg(max(col("newMobile")), max(col("newPhone")))
尝试:
df.groupBy('id).agg(
max(when('name === "Mrs.", 'mobile)).alias("mobile"),
max(when('name === "Dr.", 'phone)).alias("phone")
)