java.lang.NumberFormatException:对于输入字符串:"nan" on select count(*) on a table

java.lang.NumberFormatException: For input string: "nan" on select count(*) on a table

case class Varnish(ID: String, varnish_latency: Float)


val seq = sc.sequenceFile[LongWritable, BytesWritable](logfile_path)
val usableRDD = seq.map({case (_, v : BytesWritable) => Text.decode(v.getBytes)})
                   .map(_.split(" "))
                   .map(p => Varnish(p(11), p(8).toFloat))
                   .toDF()
usableRDD.registerTempTable("Varnish")
sqlContext.sql("SELECT * from Varnish LIMIT 5").collect().foreach(println) // works fine
val countResult = sqlContext.sql("SELECT COUNT(*) FROM Varnish").collect() // throws Err
val cnt2 = countResult.head.getLong(0)

16/01/23 02:56:18 sparkDriver-akka.actor.default-dispatcher-20 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/01/23 02:56:18 Thread-3 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 57 in stage 1.0 failed 4 times, most recent failure:
Lost task 57.3 in stage 1.0 (TID 89, 10.1.201.14): java.lang.NumberFormatException: For input string: "nan"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)

异常似乎是self-explanatory。您传递的某些值包含 nan 字符串,该字符串未被解释为有效的 Float 表示:

scala> "nan".toFloat
java.lang.NumberFormatException: For input string: "nan"
...

只要数据不是来自已经过验证的源(如 RDBMS 或 Parquet 文件),您就永远不应盲目地相信它具有正确的格式。您可以使用选项修改代码以正确处理这种情况和其他格式错误的条目:

import scala.util.Try

case class Varnish(ID: String, varnish_latency: Option[Float])

...
  .map(p => Varnish(p(11), Try(p(8).toFloat).toOption))

丢弃案例 class 并使用 SQL:

处理
...
  .map(p => Varnish(p(11), p(8)))
  .toDF("ID", "varnish_latency")
  .withColumn("varnish_latency", $"varnish_latency".cast("double"))

或 pre-validate,然后调用 .toFloat 并删除格式错误的条目。

前两个选项会将 Nones 转换为 nulls。由于它在语义上不精确(原始 not-a-number 与缺失值)并导致信息丢失,您可能更喜欢显式处理 "nan" 大小写。例如,可以通过在调用 toFloat 或模式匹配之前将 "nan" 替换为 "NaN"(正确表示)来完成:

p(8) match {
  case "nan" => Float.NaN
  case s => s.toFloat
}