flink job manager 应该在 zookeeper 升级期间崩溃吗?

Should flink job manager crash during zookeeper upgrade?

我试图了解 flink jobmanager 在 zookeeper 升级期间的行为是否符合预期。

我 运行 flink 1.11.2 in kubernetes,zookeeper server 3.5.4-beta。 当我进行 zookeeper 升级时,zookeeper 有 20 秒的停机时间。在这 20 秒内,我希望 flink 作业重新启动或日志中的警告很少。相反,我看到整个 flink JVM 崩溃(然后 pod 重新启动)。

我原以为 flink 会在内部重试 zookeeper 请求,所以我很惊讶它崩溃了。这是预期的还是错误?

来自日志

org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:00.197 UTC] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:00.197 UTC] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Socket connection established to zdzk.servicexxx/192.168.190.92:2181, initiating session
[09-Feb-2021 11:30:00.198 UTC] WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[?:1.8.0_192]
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:02.294 UTC] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:02.295 UTC] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Socket connection established to zdzk.servicexxx/192.168.190.92:2181, initiating session
[09-Feb-2021 11:30:02.295 UTC] WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[?:1.8.0_192]
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:03.841 UTC] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server zdzk.servicexxx/192.168.190.92:2181
[09-Feb-2021 11:30:03.842 UTC] INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Socket connection established to zdzk.servicexxx/192.168.190.92:2181, initiating session
[09-Feb-2021 11:30:03.842 UTC] WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Session 0x3012b0057140004 for server zdzk.servicexxx/192.168.190.92:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_192]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.8.0_192]
    at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_192]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[?:1.8.0_192]
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
[09-Feb-2021 11:30:04.175 UTC] ERROR org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl [] - Background operation retry gave up
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access0(CuratorFrameworkImpl.java:66) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.call(CuratorFrameworkImpl.java:346) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_192]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_192]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
[09-Feb-2021 11:30:04.176 UTC] ERROR org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Received error from LeaderRetrievalService.
org.apache.flink.util.FlinkException: Unhandled error in ZooKeeperLeaderRetrievalService:Background operation retry gave up
    at org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.unhandledError(ZooKeeperLeaderRetrievalService.java:208) [flink-dist_2.11-1.11.2.jar:1.11.2]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.apply(CuratorFrameworkImpl.java:713) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.apply(CuratorFrameworkImpl.java:709) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.run(ListenerContainer.java:100) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:874) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access0(CuratorFrameworkImpl.java:66) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.call(CuratorFrameworkImpl.java:346) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_192]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_192]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
Caused by: org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    ... 10 more
[09-Feb-2021 11:30:04.178 UTC] ERROR org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] - Leader Election Service encountered a fatal error.
org.apache.flink.util.FlinkException: Unhandled error in ZooKeeperLeaderElectionService: Background operation retry gave up
    at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430) [flink-dist_2.11-1.11.2.jar:1.11.2]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.apply(CuratorFrameworkImpl.java:713) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.apply(CuratorFrameworkImpl.java:709) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.run(ListenerContainer.java:100) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:874) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access0(CuratorFrameworkImpl.java:66) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.call(CuratorFrameworkImpl.java:346) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_192]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_192]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
Caused by: org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    ... 10 more
[09-Feb-2021 11:30:04.179 UTC] ERROR org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Received error from LeaderRetrievalService.
org.apache.flink.util.FlinkException: Unhandled error in ZooKeeperLeaderRetrievalService:Background operation retry gave up
    at org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.unhandledError(ZooKeeperLeaderRetrievalService.java:208) [flink-dist_2.11-1.11.2.jar:1.11.2]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.apply(CuratorFrameworkImpl.java:713) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.apply(CuratorFrameworkImpl.java:709) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.run(ListenerContainer.java:100) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:874) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access0(CuratorFrameworkImpl.java:66) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.call(CuratorFrameworkImpl.java:346) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_192]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_192]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
Caused by: org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    ... 10 more
[09-Feb-2021 11:30:04.180 UTC] ERROR org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Fatal error occurred in ResourceManager.
org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: Received an error from the LeaderElectionService.
    at org.apache.flink.runtime.resourcemanager.ResourceManager.handleError(ResourceManager.java:1053) [flink-dist_2.11-1.11.2.jar:1.11.2]
    at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430) [flink-dist_2.11-1.11.2.jar:1.11.2]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.apply(CuratorFrameworkImpl.java:713) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.apply(CuratorFrameworkImpl.java:709) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.run(ListenerContainer.java:100) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:874) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access0(CuratorFrameworkImpl.java:66) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.call(CuratorFrameworkImpl.java:346) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_192]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_192]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
Caused by: org.apache.flink.util.FlinkException: Unhandled error in ZooKeeperLeaderElectionService: Background operation retry gave up
    ... 18 more
Caused by: org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    ... 10 more
[09-Feb-2021 11:30:04.181 UTC] ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: Received an error from the LeaderElectionService.
    at org.apache.flink.runtime.resourcemanager.ResourceManager.handleError(ResourceManager.java:1053) [flink-dist_2.11-1.11.2.jar:1.11.2]
    at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430) [flink-dist_2.11-1.11.2.jar:1.11.2]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.apply(CuratorFrameworkImpl.java:713) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.apply(CuratorFrameworkImpl.java:709) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.run(ListenerContainer.java:100) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:874) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access0(CuratorFrameworkImpl.java:66) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.call(CuratorFrameworkImpl.java:346) [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_192]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_192]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
Caused by: org.apache.flink.util.FlinkException: Unhandled error in ZooKeeperLeaderElectionService: Background operation retry gave up
    ... 18 more
Caused by: org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]
    ... 10 more
[09-Feb-2021 11:30:04.196 UTC] INFO org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124

如果在升级期间维持了 zookeeper quorum,那么 Flink 作业管理器应该不会受到影响。否则作业管理器失败也就不足为奇了。

通常你会先升级zookeeper followers,一个接一个,最后再升级leader。在关闭另一个节点之前验证是否已重新建立仲裁。

收到flink社区的回复 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Should-flink-job-manager-crash-during-zookeeper-upgrade-tt41393.html

我需要调整的是以下2个参数,并确保flink等待的时间比zookeeper寻找新leader的时间长

https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#high-availability-zookeeper-client-max-retry-attempts https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#high-availability-zookeeper-client-retry-wait