Spark:广播对象时内存不足
Spark: out of memory when broadcasting objects
我试图广播一个不太大的地图(以文本文件形式保存到 HDFS 时约为 70 MB),但我遇到了内存不足的错误。我尝试将驱动程序内存增加到11G,执行程序内存增加到11G,仍然出现同样的错误。 memory.fraction设置为0.3,缓存的数据也不多(小于1G)。
当地图只有 2 MB 左右时,没有问题。我想知道广播对象时是否有大小限制。我怎样才能使用更大的地图解决这个问题?谢谢!
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:159)
at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:229)
at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:194)
at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:186)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:54)
at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:165)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:143)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:648)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1006)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1327)
编辑:
根据评论补充更多信息:
- 我在客户端模式下使用spark-submit提交编译好的jar文件。火花 1.5.0
- spark.yarn.executor.memoryOverhead 600
- 设置("spark.kryoserializer.buffer.max", "256m")
- 设置("spark.speculation","true")
- 设置("spark.storage.memoryFraction", "0.3")
- 设置("spark.driver.memory", "15G")
- 设置("spark.executor.memory", "11G")
- 我尝试了 set("spar.sql.tungsten.enabled", "false") 但没有用。
- 主机内存60G。大约 30G 用于 Spark/Yarn。我不确定我的工作有多少堆大小,但同时进行的其他进程不多。特别是地图只有70MB左右
广播相关的一些代码:
val mappingAllLocal: Map[String, Int] = mappingAll.rdd.map(r => (r.getAs[String](0), r.getAs[Int](1))).collectAsMap().toMap
// I can use the above mappingAll to HDFS, and it's around 70MB
val mappingAllBrd = sc.broadcast(mappingAllLocal) // <-- this is where the out of memory happens
您可以尝试增加 JVM 堆大小:
-Xmx2g : max size of 2Go
-Xms2g : initial size of 2Go (default size is 256mo)
使用set("spark.driver.memory", "15G")
对客户端模式没有影响。您必须在提交申请时使用命令行参数--conf="spark.driver.memory=15G"
以增加驱动程序的堆大小。
我试图广播一个不太大的地图(以文本文件形式保存到 HDFS 时约为 70 MB),但我遇到了内存不足的错误。我尝试将驱动程序内存增加到11G,执行程序内存增加到11G,仍然出现同样的错误。 memory.fraction设置为0.3,缓存的数据也不多(小于1G)。
当地图只有 2 MB 左右时,没有问题。我想知道广播对象时是否有大小限制。我怎样才能使用更大的地图解决这个问题?谢谢!
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:159)
at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:229)
at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:194)
at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:186)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:54)
at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:165)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:143)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:648)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1006)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1327)
编辑: 根据评论补充更多信息:
- 我在客户端模式下使用spark-submit提交编译好的jar文件。火花 1.5.0
- spark.yarn.executor.memoryOverhead 600
- 设置("spark.kryoserializer.buffer.max", "256m")
- 设置("spark.speculation","true")
- 设置("spark.storage.memoryFraction", "0.3")
- 设置("spark.driver.memory", "15G")
- 设置("spark.executor.memory", "11G")
- 我尝试了 set("spar.sql.tungsten.enabled", "false") 但没有用。
- 主机内存60G。大约 30G 用于 Spark/Yarn。我不确定我的工作有多少堆大小,但同时进行的其他进程不多。特别是地图只有70MB左右
广播相关的一些代码:
val mappingAllLocal: Map[String, Int] = mappingAll.rdd.map(r => (r.getAs[String](0), r.getAs[Int](1))).collectAsMap().toMap
// I can use the above mappingAll to HDFS, and it's around 70MB
val mappingAllBrd = sc.broadcast(mappingAllLocal) // <-- this is where the out of memory happens
您可以尝试增加 JVM 堆大小:
-Xmx2g : max size of 2Go
-Xms2g : initial size of 2Go (default size is 256mo)
使用set("spark.driver.memory", "15G")
对客户端模式没有影响。您必须在提交申请时使用命令行参数--conf="spark.driver.memory=15G"
以增加驱动程序的堆大小。