Scala Spark 在 dataframe 和 dataset 中处理 Double.NaN 不同

Question

在测试中，我正在尝试将 dataframe/datasets 转换为集合并进行比较。例如

actualResult.collect.toSet should be(expectedResult.collect.toSet)

我注意到一些关于 Double.NaN 值的事实。

在 Scala 中，Double.NaN == Double.NaN returns 错误。
在 spark NaN == NaN 中是正确的。 (offical doc)

但我想不通为什么 dataframe 和 dataset 的行为不同。

import org.apache.spark.sql.SparkSession

object Main extends App {
  val spark = SparkSession.builder().appName("Example").master("local").getOrCreate()
  import spark.implicits._

  val dataSet = spark.createDataset(Seq(Book("book 1", Double.NaN)))

  // Compare Set(Book(book 1,NaN)) to itself
  println(dataSet.collect.toSet == dataSet.collect.toSet) //false, why?

  // Compare Set([book 1,NaN]) to itself
  println(dataSet.toDF().collect.toSet == dataSet.toDF().collect.toSet) //true, why?
}

case class Book (title: String, price: Double)

这是我的问题。感谢任何见解。

它是如何在代码中发生的？（equals 在哪里被覆盖？等等）
这种设计背后有什么原因吗？有没有更好的范例在测试中断言 dataset/dataframe？

Answer 1

关于这个话题，我想分享几点。

当您执行 dataSet.collect.toSet 时，您将其收集为 Set[Book]，当您对两组图书对象进行比较时。

您在Book Case class中定义的单个（书籍）对象相等方法用于比较。 这就是为什么 println(dataSet.collect.toSet == dataSet.collect.toSet) return 由于 Double.NaN == Double.NaN returns false.

错误

当您执行 dataSet.toDF().collect.toSet 时，您将其收集为 Set[Row]

当您执行 toDF 时，spark 将转换**（即序列化 Book 然后反序列化为 javaType 字段 Row） ** Book class to Row 在此过程中，它还会使用 RowEncoders.

对字段进行一些转换

在RowEncoder.scala

中使用以下代码将所有原始字段转换为java类型

def apply(schema: StructType): ExpressionEncoder[Row] = {
    val cls = classOf[Row]
    **val inputObject = BoundReference(0, ObjectType(cls), nullable = true)
    val serializer = serializerFor(AssertNotNull(inputObject, Seq("top level row object")), schema)
    val deserializer = deserializerFor(schema)**
    new ExpressionEncoder[Row](
      schema,
      flat = false,
      serializer.asInstanceOf[CreateNamedStruct].flatten,
      deserializer,
      ClassTag(cls))
  }

如果你查看Double.java和Float.java相等方法的源代码。 NAN 的比较将 return 为真。这就是为什么行对象比较将 return 为真。 println(dataSet.toDF().collect.toSet == dataSet.toDF().collect.toSet) 为真。

<li>If {@code d1} and {@code d2} both represent
     *     {@code Double.NaN}, then the {@code equals} method
     *     returns {@code true}, even though
     *     {@code Double.NaN==Double.NaN} has the value
     *     {@code false}.
     * <li>If {@code d1} represents {@code +0.0} while
     *     {@code d2} represents {@code -0.0}, or vice versa,
     *     the {@code equal} test has the value {@code false},
     *     even though {@code +0.0==-0.0} has the value {@code true}.
     * </ul>

**如果我有语法错误，请见谅。

Scala Spark 在 dataframe 和 dataset 中处理 Double.NaN 不同

Scala Spark handles Double.NaN differently in dataframe and dataset

scala

nan

dataset

dataframe

apache-spark