如何处理"could not execute broadcast in 300 secs"?

How to deal with "could not execute broadcast in 300 secs"?

我正在尝试构建工作,其中一个阶段间歇性失败并出现以下错误:

Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1

我该如何处理这个错误?

首先,让我们谈谈该错误的含义。

来自官方 Spark 文档 (http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables):

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

根据我的经验,当其中一个输入数据集分区不当时,通常会发生广播超时。我建议您查看数据集的分区并确保它们分区正确,而不是禁用广播。

我使用的经验法则是将数据集的大小(以 MB 为单位)除以 100,然后将分区数设置为该数。由于 HDFS 块大小为 125 MB,我们希望将文件溢出到大约 125 MB,但由于它们不能完美分割,我们可以除以较小的数字以获得更多分区。

主要是非常小的数据集 (~<125 MB) 位于单个分区中,因为网络开销太大了!希望这有帮助。