火花条件替换值

Question

对于 pandas 我有这样的代码片段：

def setUnknownCatValueConditional(df, conditionCol, condition, colToSet, _valueToSet='KEINE'):
    df.loc[(df[conditionCol] == condition) & (df[colToSet].isnull()), colToSet] = _valueToSet

有条件地替换数据框中的值。

正在尝试将此功能移植到 spark

df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show

不适合我

df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
warning: there was one feature warning; re-run with -feature for details
org.apache.spark.sql.AnalysisException: cannot resolve '((`A` = 'x') AND `B`)' due to data type mismatch: differing types in '((`A` = 'X') AND `B`)' (boolean and string).;;

即使 df.printSchema returns A 和 b 的字符串

这里有什么问题？

编辑

一个最小的例子：

import java.sql.{ Date, Timestamp }
case class FooBar(foo:Date, bar:String)
val myDf = Seq(("2016-01-01","first"),("2016-01-02","second"),("2016-wrongFormat","noValidFormat"), ("2016-01-04","lastAssumingSameDate"))
         .toDF("foo","bar")
         .withColumn("foo", 'foo.cast("Date"))
         .as[FooBar]

myDf.printSchema
root
 |-- foo: date (nullable = true)
 |-- bar: string (nullable = true)


scala> myDf.show
+----------+--------------------+
|       foo|                 bar|
+----------+--------------------+
|2016-01-01|               first|
|2016-01-02|              second|
|      null|       noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+

myDf.withColumn("foo", when($"bar" === "noValidFormat" and $"foo" isNull, "noValue")).show

以及预期的输出

+----------+--------------------+
|       foo|                 bar|
+----------+--------------------+
|2016-01-01|               first|
|2016-01-02|              second|
| "noValue"|       noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+

edit2

如果需要条件链接

df
    .withColumn("A",
      when(
        (($"B" === "x") and ($"B" isNull)) or
        (($"B" === "y") and ($"B" isNull)), "replacement")

应该可以

Answer 1

注意运算符的优先级。应该是：

myDf.withColumn("foo",
  when(($"bar" === "noValidFormat") and ($"foo" isNull), "noValue"))

这个：

$"bar" === "noValidFormat" and $"foo" isNull

被评估为：

(($"bar" === "noValidFormat") and $"foo") isNull

火花条件替换值

spark conditional replacement of values

apache-spark

apache-spark-sql

spark-dataframe

编辑

edit2