火花条件替换值
spark conditional replacement of values
对于 pandas 我有这样的代码片段:
def setUnknownCatValueConditional(df, conditionCol, condition, colToSet, _valueToSet='KEINE'):
df.loc[(df[conditionCol] == condition) & (df[colToSet].isnull()), colToSet] = _valueToSet
有条件地替换数据框中的值。
正在尝试将此功能移植到 spark
df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
不适合我
df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
warning: there was one feature warning; re-run with -feature for details
org.apache.spark.sql.AnalysisException: cannot resolve '((`A` = 'x') AND `B`)' due to data type mismatch: differing types in '((`A` = 'X') AND `B`)' (boolean and string).;;
即使 df.printSchema returns A 和 b 的字符串
这里有什么问题?
编辑
一个最小的例子:
import java.sql.{ Date, Timestamp }
case class FooBar(foo:Date, bar:String)
val myDf = Seq(("2016-01-01","first"),("2016-01-02","second"),("2016-wrongFormat","noValidFormat"), ("2016-01-04","lastAssumingSameDate"))
.toDF("foo","bar")
.withColumn("foo", 'foo.cast("Date"))
.as[FooBar]
myDf.printSchema
root
|-- foo: date (nullable = true)
|-- bar: string (nullable = true)
scala> myDf.show
+----------+--------------------+
| foo| bar|
+----------+--------------------+
|2016-01-01| first|
|2016-01-02| second|
| null| noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+
myDf.withColumn("foo", when($"bar" === "noValidFormat" and $"foo" isNull, "noValue")).show
以及预期的输出
+----------+--------------------+
| foo| bar|
+----------+--------------------+
|2016-01-01| first|
|2016-01-02| second|
| "noValue"| noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+
edit2
如果需要条件链接
df
.withColumn("A",
when(
(($"B" === "x") and ($"B" isNull)) or
(($"B" === "y") and ($"B" isNull)), "replacement")
应该可以
注意运算符的优先级。应该是:
myDf.withColumn("foo",
when(($"bar" === "noValidFormat") and ($"foo" isNull), "noValue"))
这个:
$"bar" === "noValidFormat" and $"foo" isNull
被评估为:
(($"bar" === "noValidFormat") and $"foo") isNull
对于 pandas 我有这样的代码片段:
def setUnknownCatValueConditional(df, conditionCol, condition, colToSet, _valueToSet='KEINE'):
df.loc[(df[conditionCol] == condition) & (df[colToSet].isnull()), colToSet] = _valueToSet
有条件地替换数据框中的值。
正在尝试将此功能移植到 spark
df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
不适合我
df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
warning: there was one feature warning; re-run with -feature for details
org.apache.spark.sql.AnalysisException: cannot resolve '((`A` = 'x') AND `B`)' due to data type mismatch: differing types in '((`A` = 'X') AND `B`)' (boolean and string).;;
即使 df.printSchema returns A 和 b 的字符串
这里有什么问题?
编辑
一个最小的例子:
import java.sql.{ Date, Timestamp }
case class FooBar(foo:Date, bar:String)
val myDf = Seq(("2016-01-01","first"),("2016-01-02","second"),("2016-wrongFormat","noValidFormat"), ("2016-01-04","lastAssumingSameDate"))
.toDF("foo","bar")
.withColumn("foo", 'foo.cast("Date"))
.as[FooBar]
myDf.printSchema
root
|-- foo: date (nullable = true)
|-- bar: string (nullable = true)
scala> myDf.show
+----------+--------------------+
| foo| bar|
+----------+--------------------+
|2016-01-01| first|
|2016-01-02| second|
| null| noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+
myDf.withColumn("foo", when($"bar" === "noValidFormat" and $"foo" isNull, "noValue")).show
以及预期的输出
+----------+--------------------+
| foo| bar|
+----------+--------------------+
|2016-01-01| first|
|2016-01-02| second|
| "noValue"| noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+
edit2
如果需要条件链接
df
.withColumn("A",
when(
(($"B" === "x") and ($"B" isNull)) or
(($"B" === "y") and ($"B" isNull)), "replacement")
应该可以
注意运算符的优先级。应该是:
myDf.withColumn("foo",
when(($"bar" === "noValidFormat") and ($"foo" isNull), "noValue"))
这个:
$"bar" === "noValidFormat" and $"foo" isNull
被评估为:
(($"bar" === "noValidFormat") and $"foo") isNull