Spark：执行程序内存超出物理限制

Question

我的输入数据集大约有 150G。我正在设置

--conf spark.cores.max=100 
--conf spark.executor.instances=20 
--conf spark.executor.memory=8G 
--conf spark.executor.cores=5 
--conf spark.driver.memory=4G

但是由于数据在执行者之间分布不均，我一直在

Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used

这是我的问题：

1. Did I not set up enough memory in the first place? I think 20 * 8G > 150G, but it's hard to make perfect distribution, so some executors will suffer
2. I think about repartition the input dataFrame, so how can I determine how many partition to set? the higher the better, or?
3. The error says "9 GB physical memory used", but i only set 8G to executor memory, where does the extra 1G come from?

谢谢！

Answer 1

使用 yarn 时，还有另一个设置可以计算出为您的执行程序发出 yarn 容器请求的大小：

spark.yarn.executor.memoryOverhead

它默认为 0.1 * 您的执行程序内存设置。它定义了除了您指定的执行程序内存之外还需要多少额外的开销内存。先尝试增加这个数字。

此外，纱线容器不会为您提供任意大小的内存。它只会分配 return 个容器，其内存大小是其最小分配大小的倍数，由以下设置控制：

yarn.scheduler.minimum-allocation-mb

将其设置为较小的数字将降低您 "overshooting" 您要求的金额的风险。

我通常还会将下面的键设置为大于我想要的容器大小的值，以确保 spark 请求控制我的执行程序有多大，而不是 yarn 踩在它们上面。这是纱线将给出的最大容器尺寸。

nodemanager.resource.memory-mb

Answer 2

9GB由你作为参数添加的8GB执行器内存组成，spark.yarn.executor.memoryOverhead设置为.1，所以容器的总内存为spark.yarn.executor.memoryOverhead + (spark.yarn.executor.memoryOverhead * spark.yarn.executor.memoryOverhead)是 8GB + (.1 * 8GB) ≈ 9GB.

您可以运行使用单个执行程序完成整个过程，但这需要很长时间。 To understand this you need to know the notion of partitions and tasks. 分区数由您的输入和操作定义。例如，如果您从 hdfs 读取 150gb 的 csv，而您的 hdfs 块大小为 128mb，您最终将得到 150 * 1024 / 128 = 1200 个分区，它直接映射到 Spark UI.

中的 1200 个任务。

每一个任务都会被执行者拾取。您永远不需要在内存中保留所有 150gb。例如，当你有一个单一的执行者时，你显然不会从 Spark 的并行能力中获益，但它只会从第一个任务开始，处理数据，并将其保存回 dfs，然后开始处理下一个任务。

您应该检查的内容：

输入分区有多大？ Is the input file splittable at all? 如果单个执行器必须加载大量内存，它肯定会运行内存不足。
你在做什么？例如，如果您使用非常低的基数进行连接，您最终会得到一个巨大的分区，因为具有特定值的所有行最终都在相同的分区中。
执行了非常昂贵或低效的操作？任何笛卡尔积等

希望这对您有所帮助。快乐火花！

Spark：执行程序内存超出物理限制

Spark: executor memory exceeds physical limit

apache-spark

spark-dataframe