Spark 执行器由于超过 GC 开销限制而丢失,即使使用 20 个执行器,每个执行器使用 25GB
Spark executor lost because of GC overhead limit exceeded even though using 20 executors using 25GB each
这个 GC 开销限制错误让我抓狂。我有 20 个执行程序,每个执行程序使用 25 GB 我根本不明白它怎么会引发 GC 开销我也不知道那个大数据集。一旦这个 GC 错误在执行器中发生,它就会丢失,并且慢慢地其他执行器会丢失,因为 IOException、Rpc 客户端断开关联、找不到随机播放等。
我是 Spark 新手。
WARN scheduler.TaskSetManager: Lost task 7.0 in stage 363.0 (TID 3373, myhost.com): java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.sql.types.UTF8String.toString(UTF8String.scala:150)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:120)
at org.apache.spark.sql.columnar.STRING$.actualSize(ColumnType.scala:312)
at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.gatherCompressibilityStats(compressionSchemes.scala:224)
at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.gatherCompressibilityStats(CompressibleColumnBuilder.scala:72)
at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:80)
at org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87)
at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$$anon.next(InMemoryColumnarTableScan.scala:148)
at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$$anon.next(InMemoryColumnarTableScan.scala:124)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
当 cpu 花费超过 98% 用于垃圾收集任务时,将抛出超出 GC 开销限制。在 Scala 中使用不可变数据结构时会发生这种情况,因为对于每次转换,JVM 都必须重新创建大量新对象并从堆中删除以前的对象。因此,如果这是您的问题,请尝试改用一些可变数据结构。
请阅读本页http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning以了解如何调整您的 GC。
这个 GC 开销限制错误让我抓狂。我有 20 个执行程序,每个执行程序使用 25 GB 我根本不明白它怎么会引发 GC 开销我也不知道那个大数据集。一旦这个 GC 错误在执行器中发生,它就会丢失,并且慢慢地其他执行器会丢失,因为 IOException、Rpc 客户端断开关联、找不到随机播放等。
我是 Spark 新手。
WARN scheduler.TaskSetManager: Lost task 7.0 in stage 363.0 (TID 3373, myhost.com): java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.sql.types.UTF8String.toString(UTF8String.scala:150)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:120)
at org.apache.spark.sql.columnar.STRING$.actualSize(ColumnType.scala:312)
at org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.gatherCompressibilityStats(compressionSchemes.scala:224)
at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.gatherCompressibilityStats(CompressibleColumnBuilder.scala:72)
at org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:80)
at org.apache.spark.sql.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:87)
at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$$anon.next(InMemoryColumnarTableScan.scala:148)
at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$$anon.next(InMemoryColumnarTableScan.scala:124)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:277)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
当 cpu 花费超过 98% 用于垃圾收集任务时,将抛出超出 GC 开销限制。在 Scala 中使用不可变数据结构时会发生这种情况,因为对于每次转换,JVM 都必须重新创建大量新对象并从堆中删除以前的对象。因此,如果这是您的问题,请尝试改用一些可变数据结构。
请阅读本页http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning以了解如何调整您的 GC。