Spark Dataframe 中 `float` 与 `np.nan` 的比较

Question

这是预期的行为吗？我想提出一个 Spark 问题，但这似乎是一个基本功能，很难想象这里有一个错误。我错过了什么？

Python

import numpy as np

>>> np.nan < 0.0
False

>>> np.nan > 0.0
False

PySpark

from pyspark.sql.functions import col

df = spark.createDataFrame([(np.nan, 0.0),(0.0, np.nan)])
df.show()
#+---+---+
#| _1| _2|
#+---+---+
#|NaN|0.0|
#|0.0|NaN|
#+---+---+

df.printSchema()
#root
# |-- _1: double (nullable = true)
# |-- _2: double (nullable = true)

df.select(col("_1")> col("_2")).show()
#+---------+
#|(_1 > _2)|
#+---------+
#|     true|
#|    false|
#+---------+

Answer 1

这是预期和记录在案的行为。引用 NaN Semantics section of the official Spark SQL Guide（强调我的）：

There is specially handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically:

NaN = NaN returns true.

In aggregations, all NaN values are grouped together.

NaN is treated as a normal value in join keys.

NaN values go last when in ascending order, larger than any other numeric value.

Ad 如您所见，与 Python NaN 相比，排序行为并不是唯一的区别。特别是 Spark 认为 NaN 等于：

spark.sql("""
    WITH table AS (SELECT CAST('NaN' AS float) AS x, cast('NaN' AS float) AS y) 
    SELECT x = y, x != y FROM table
""").show()

+-------+-------------+
|(x = y)|(NOT (x = y))|
+-------+-------------+
|   true|        false|
+-------+-------------+

而普通 Python

float("NaN") == float("NaN"), float("NaN") != float("NaN")

(False, True)

和 NumPy

np.nan == np.nan, np.nan != np.nan

(False, True)

不要。

您可以查看 eqNullSafe docstring 以获取更多示例。

因此，要获得所需的结果，您必须明确检查 NaN

from pyspark.sql.functions import col, isnan, when

when(isnan("_1") | isnan("_2"), False).otherwise(col("_1") > col("_2"))

Spark Dataframe 中 `float` 与 `np.nan` 的比较

Comparison of a `float` to `np.nan` in Spark Dataframe

python

numpy

nan

apache-spark

pyspark