为什么集群模式下 YARN 上的 Spark 会因 "Exception in thread "Driver" java.lang.NullPointerException" 而失败?

Why does Spark on YARN in cluster mode fail with "Exception in thread "Driver" java.lang.NullPointerException"?

我在 Spark 2.1.0 中使用 emr-5.4.0。我明白 NullPointerException 是什么,这个问题是关于为什么在这种特殊情况下抛出的。

无法真正弄清楚为什么我在驱动程序线程中出现 NullPointerException。

我的这个奇怪的工作因这个错误而失败:

18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a separate Thread
18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context initialization...
Exception in thread "Driver" java.lang.NullPointerException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon.run(ApplicationMaster.scala:637)
18/03/29 20:07:52 ERROR ApplicationMaster: Uncaught exception:
java.lang.IllegalStateException: SparkContext is null but app is still running!
    at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:415)
    at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:254)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main.apply$mcV$sp(ApplicationMaster.scala:766)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon.run(SparkHadoopUtil.scala:67)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon.run(SparkHadoopUtil.scala:66)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
    at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:764)
    at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
18/03/29 20:07:52 INFO ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.lang.IllegalStateException: SparkContext is null but app is still running!)
18/03/29 20:07:52 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Uncaught exception: java.lang.IllegalStateException: SparkContext is null but app is still running!)
18/03/29 20:07:52 INFO ApplicationMaster: Deleting staging directory hdfs://<ip-address>.ec2.internal:8020/user/hadoop/.sparkStaging/application_1522348295743_0010
18/03/29 20:07:52 INFO ShutdownHookManager: Shutdown hook called
End of LogType:stderr

我提交的这份工作是这样的:

spark-submit --deploy-mode cluster --master yarn --num-executors 40 --executor-cores 16 --executor-memory 100g --driver-cores 8 --driver-memory 100g --class <package.class_name> --jars <s3://s3_path/some_lib.jar> <s3://s3_path/class.jar>

我的 class 看起来像这样:

class MyClass {

  def main(args: Array[String]): Unit = {
    val c = new MyClass()
    c.process()
  }

  def process(): Unit = {
    val sparkConf = new SparkConf().setAppName("my-test")
    val sparkSession: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
    import sparkSession.implicits._
    ....
  }

  ...
}

class MyClass 更改为 object MyClass 即可。

在我们这样做的同时,我还将 class MyClass 更改为 object MyClass extends App 并删除 def main(args: Array[String]): Unit(由 extends App 给出)。

我报告了 Spark 2.3.0 的改进 - [SPARK-23830] Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object - 可以很好地向最终用户报告。


深入了解 Spark on YARN 的工作原理,以下消息是当 ApplicationMaster of a Spark application starts the driver(您将 --deploy-mode cluster --master yarnspark-submit 一起使用时)。

ApplicationMaster: Starting the user application in a separate Thread

在 INFO 消息之后,您应该会看到另一个消息:

ApplicationMaster: Waiting for spark context initialization...

这是 driver initialization when the ApplicationMaster runs 的一部分。

异常的原因Exception in thread "Driver" java.lang.NullPointerException是由于following code:

val mainMethod = userClassLoader.loadClass(args.userClass)
  .getMethod("main", classOf[Array[String]])

我的理解是mainMethod此时是null所以following line(其中mainMethodnull)"triggers"NullPointerException:

mainMethod.invoke(null, userArgs.toArray)

该线程确实被称为 Driver(如 Exception in thread "Driver" java.lang.NullPointerException),如 this line 中所设置:

userThread.setContextClassLoader(userClassLoader)
userThread.setName("Driver")
userThread.start()

行号不同,因为我使用 Spark 2.3.0 来引用行,而你使用 emr-5.4.0 和 Spark 2.1.0。