如何处理"could not execute broadcast in 300 secs"？

Question

我正在尝试构建工作，其中一个阶段间歇性失败并出现以下错误：

Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1

我该如何处理这个错误？

Answer 1

首先，让我们谈谈该错误的含义。

来自官方 Spark 文档 (http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables)：

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

根据我的经验，当其中一个输入数据集分区不当时，通常会发生广播超时。我建议您查看数据集的分区并确保它们分区正确，而不是禁用广播。

我使用的经验法则是将数据集的大小（以 MB 为单位）除以 100，然后将分区数设置为该数。由于 HDFS 块大小为 125 MB，我们希望将文件溢出到大约 125 MB，但由于它们不能完美分割，我们可以除以较小的数字以获得更多分区。

主要是非常小的数据集 (~<125 MB) 位于单个分区中，因为网络开销太大了！希望这有帮助。

如何处理"could not execute broadcast in 300 secs"？

How to deal with "could not execute broadcast in 300 secs"?

palantir-foundry

foundry-code-workbooks