Flink 保存点超时
Flink Savepoint Timeout
我使用 Flink 1.11 版本,在保存点期间出现超时问题
我的保存点大小约为 4Gb ++
如何增加保存点超时?
谢谢
请参阅 flink 文档的 Enabling and Configuring Checkpointing 部分。
您可以通过
将保存点超时增加到 1 分钟
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(60000);
我还建议增加检查点之间的最短时间,以确保流式应用程序通过
在检查点之间取得一定的进展
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
我遇到了同样的问题,您可以从我非常相似的日志中看到:
org.apache.flink.util.FlinkException: Triggering a savepoint for the job 63a70a46cf5bffda3ca0a1e791113122 failed.
at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777)
at org.apache.flink.client.cli.CliFrontend.lambda$savepoint(CliFrontend.java:754)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751)
at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072)
at org.apache.flink.client.cli.CliFrontend.lambda$main(CliFrontend.java:1132)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:771)
... 10 more
原因不是保存点超时,而是客户端通信超时。
对我来说,这发生在 EMR 上。在主节点上编辑 /etc/flink/conf.dist/flink-conf.yaml
以添加以下内容,将超时增加到 5 分钟,达到了目的:
akka.client.timeout: 300000
对于一些额外的颜色,我使用的保存点大小是从 4 个实例中提取的 160.3 GiB。
我使用 Flink 1.11 版本,在保存点期间出现超时问题
我的保存点大小约为 4Gb ++
如何增加保存点超时?
谢谢
请参阅 flink 文档的 Enabling and Configuring Checkpointing 部分。
您可以通过
将保存点超时增加到 1 分钟// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(60000);
我还建议增加检查点之间的最短时间,以确保流式应用程序通过
在检查点之间取得一定的进展// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
我遇到了同样的问题,您可以从我非常相似的日志中看到:
org.apache.flink.util.FlinkException: Triggering a savepoint for the job 63a70a46cf5bffda3ca0a1e791113122 failed.
at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:777)
at org.apache.flink.client.cli.CliFrontend.lambda$savepoint(CliFrontend.java:754)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1002)
at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:751)
at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1072)
at org.apache.flink.client.cli.CliFrontend.lambda$main(CliFrontend.java:1132)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:771)
... 10 more
原因不是保存点超时,而是客户端通信超时。
对我来说,这发生在 EMR 上。在主节点上编辑 /etc/flink/conf.dist/flink-conf.yaml
以添加以下内容,将超时增加到 5 分钟,达到了目的:
akka.client.timeout: 300000
对于一些额外的颜色,我使用的保存点大小是从 4 个实例中提取的 160.3 GiB。