什么是 spark.driver.maxResultSize？

What is spark.driver.maxResultSize?

ref 说：

Limit of total size of serialized results of all partitions for each Spark action (e.g. collect). Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors.

这个属性到底有什么作用？我的意思是，一开始（因为我没有为因内存不足错误而失败的工作而苦苦挣扎）我想我应该增加它。

转念一想，这个属性似乎定义了 worker 可以发送给驱动程序的最大结果大小，因此将其保留为默认值 (1G) 将是保护驱动程序的最佳方法..

但是在这种情况下会发生，worker 将不得不发送更多的消息，所以开销只是作业会变慢？

如果我理解正确的话，假设一个worker想要发送4G的数据给driver，那么spark.driver.maxResultSize=1G，会导致worker发送4条消息（而不是1条无限制的spark.driver.maxResultSize).如果是这样，那么增加该属性以保护我的驱动程序不被 Yarn 暗杀应该是错误的。

但是上面的问题仍然存在..我的意思是如果我将它设置为1M（最小值），它是否是最保护的方法？

assuming that a worker wants to send 4G of data to the driver, then having spark.driver.maxResultSize=1G, will cause the worker to send 4 messages (instead of 1 with unlimited spark.driver.maxResultSize).

没有。如果估计的数据大小大于 maxResultSize 给定的作业将被中止。这里的目标是保护您的应用程序免受驱动程序丢失，仅此而已。

if I set it to 1M (the minimum), will it be the most protective approach?

从某种意义上说是的，但显然在实践中没有用。良好的价值应该允许应用程序正常进行，但保护应用程序免受意外情况的影响。

什么是 spark.driver.maxResultSize？

What is spark.driver.maxResultSize?

configuration

communication

driver

distributed-computing

apache-spark