Apache Spark EC2 作业不是 运行。设备上没有 space

Apache Spark EC2 job not running. No space left on device

我已经 运行在一个 20 节点集群上多次运行我的程序。突然间,每次我 运行 程序都会出现以下错误:

15/04/19 16:52:35 WARN scheduler.TaskSetManager: Lost task 35.0 in stage 9.0 (TID 384, ip-XXX.XXX.compute.internal): java.io.FileNotFoundException: /mnt/spark/spark-local-XXX-ebd3/18/shuffle_2_35_64 (No space left on device)
    java.io.FileOutputStream.open(Native Method)
    java.io.FileOutputStream.<init>(FileOutputStream.java:221)
    org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
    org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
    org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write.apply(HashShuffleWriter.scala:67)
    org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write.apply(HashShuffleWriter.scala:65)
    scala.collection.Iterator$class.foreach(Iterator.scala:727)
    scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
    org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
    org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    org.apache.spark.scheduler.Task.run(Task.scala:54)
    org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    java.lang.Thread.run(Thread.java:745)

检查 UI,它说节点上绝对没有任何内容。我有 运行 该程序可能有 15 次,只是突然之间才开始。为什么这会突然发生?我该如何解决?

"No space left on device" 是一个非常明显的例外:该节点没有 space 留在正在写入 spark 本地文件的挂载上:/mnt/spark/

解决方案:转到一个(或多个)节点并将其清理干净。 rm -rf 好吧。

如果作业在终止前中断,由于人工干预或故障,它们通常会留下临时数据。