Spark 2.0 内存分数

Spark 2.0 memory fraction

我正在使用 Spark 2.0,作业首先对输入数据进行排序并将其输出存储在 HDFS 上。

我遇到了内存不足的错误,解决方案是将 "spark.shuffle.memoryFraction" 的值从 0.2 增加到 0.8,这解决了问题。但是在文档中我发现这是一个不推荐使用的参数。

据我了解,它已被 "spark.memory.fraction" 取代。如何修改这个参数,同时兼顾HDFS上的排序和存储?

来自documentation:

Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:

  • spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%)
    is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually
    large records.
  • spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.

The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. Otherwise, when much of this space is used for caching and execution, the tenured generation will be full, which causes the JVM to significantly increase time spent in garbage collection.

中我会修改spark.storage.memoryFraction.


附带说明一下,您确定了解自己的工作方式吗?

通常先从 memoryOverhead、#cores 等开始微调您的工作,然后再转到您修改的属性。