Apache Spark 任务失败
Apache Spark Task Failure
为什么 Apache Spark 任务会失败?我想,由于 DAG,即使没有缓存任务也可以重新计算?我实际上正在缓存,我要么得到一个 filenotfoundexception
或以下:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9238.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9238.0 (TID 17337, ip-XXX-XXX-XXX.compute.internal): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_299_piece0 of broadcast_299
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:930)
org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:155)
sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:160)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
这很奇怪,因为我在较小的实例上有 运行 相同的程序,但我没有得到 filenotfoundexception - no space left on this device
,而是得到了上述错误。当我说,将实例大小加倍时,它告诉我在大约 1 小时的工作后设备上没有 space 剩余 - 相同的程序,更大的内存并且它 运行s out of space !给出了什么?
如 SPARK-751 问题中所述:
Right now on each machine, we create M * R temporary files for
shuffle, where M = number of map tasks, R = number of reduce tasks.
This can be pretty high when there are lots of mappers and reducers
(e.g. 1k map * 1k reduce = 1 million files for a single shuffle). The
high number can cripple the file system and significantly slow the
system down. We should cut this number down to O(R) instead of O(M*R).
因此,如果您确实发现您的磁盘 运行 没有 inode,您可以尝试以下方法来解决问题:
- 减少分区(参见 coalesce with shuffle = false)。
- 您也可以尝试通过“consolidating files”将分区数降低到 O(R),因为文件系统的行为不同。
- 有时您可能只是发现需要系统管理员增加 FS 支持的 inode 数量。
为什么 Apache Spark 任务会失败?我想,由于 DAG,即使没有缓存任务也可以重新计算?我实际上正在缓存,我要么得到一个 filenotfoundexception
或以下:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9238.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9238.0 (TID 17337, ip-XXX-XXX-XXX.compute.internal): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_299_piece0 of broadcast_299
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:930)
org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:155)
sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:160)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
这很奇怪,因为我在较小的实例上有 运行 相同的程序,但我没有得到 filenotfoundexception - no space left on this device
,而是得到了上述错误。当我说,将实例大小加倍时,它告诉我在大约 1 小时的工作后设备上没有 space 剩余 - 相同的程序,更大的内存并且它 运行s out of space !给出了什么?
如 SPARK-751 问题中所述:
Right now on each machine, we create M * R temporary files for shuffle, where M = number of map tasks, R = number of reduce tasks. This can be pretty high when there are lots of mappers and reducers (e.g. 1k map * 1k reduce = 1 million files for a single shuffle). The high number can cripple the file system and significantly slow the system down. We should cut this number down to O(R) instead of O(M*R).
因此,如果您确实发现您的磁盘 运行 没有 inode,您可以尝试以下方法来解决问题:
- 减少分区(参见 coalesce with shuffle = false)。
- 您也可以尝试通过“consolidating files”将分区数降低到 O(R),因为文件系统的行为不同。
- 有时您可能只是发现需要系统管理员增加 FS 支持的 inode 数量。