如果 zookeeper leader 进程被杀死,所有的 follower 是否都应该得到异常并重新启动?
If zookeeper leader process is killed, should all followers get exceptions and restart too?
我正在使用 Zookeeper 3.4.6 开发一个项目,并且正在执行一些故障模式测试。这样做时,我发现(我认为是)意外行为。
如果领导 Zookeeper 进程被杀死,follower 是否应该重新启动?
环境:
OS: Windows Server 2008 R2 (hosted in a Tanuki Java service wrapper)
Zookeeper: 3.4.6
Java JDK: 1.7.0.210
测试:
测试是杀死 Zookeeper 进程并确保集群恢复。
如果我杀死一个 非领导者 进程,它会重新启动并重新加入集群,而不会影响其他节点。
如果我终止 leader 进程,leader 和 followers 将重新启动。这似乎不对,因为有一段时间客户端无法连接到 any Zookeeper 节点。
我已经尝试了 TCP 和 UDP 通信设置,但两者表现出相同的行为。 UDP 的恢复速度是原来的两倍。
Zookeeper 设置
tickTime=2000
initLimit=5
syncLimit=2
minSessionTimeout=5000
maxSessionTimeout=120000
dataDir=C:\ProgramData\Saab OneView\ZooKeeper\zoo-data
clientPort=2181
leaderServes=yes
autopurge.purgeInterval=24
# IP addresses blanked out here
server.1=0.0.0.1:2888:3888
server.2=0.0.0.2:2888:3888
server.3=0.0.0.3:2888:3888
server.4=0.0.0.4:2888:3888
server.5=0.0.0.5:2888:3888
# This is for zookeeper->zookeeper communication
# I've tried both settings, UDP has faster recovery time
# 0 = UDP
# 3 = TCP (default)
electionAlg=3
Sample follower异常导致关机
20160309 05:35:51.958Z 20160309 05:35:51.958 [myid:3] - WARN [RecvWorker:4:QuorumCnxManager$RecvWorker@780] - Connection broken for id 4, my id = 3, error =
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.DataInputStream.readInt(Unknown Source)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:765)
20160309 05:35:51.959Z 20160309 05:35:51.959 [myid:3] - WARN [RecvWorker:4:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
20160309 05:35:51.959Z 20160309 05:35:51.959 [myid:3] - WARN [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.DataInputStream.readInt(Unknown Source)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
20160309 05:35:51.960Z 20160309 05:35:51.960 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
基于 ZOOKEEPER-3478 这是预期的行为:
It is normal behaviour that all the followers shutdown during a leader election. Since there is no leader after a leader crash, the servers that used to be followers are not followers anymore. So the followers shutdown and go back to LOOKING state in order to find the new leader.
我正在使用 Zookeeper 3.4.6 开发一个项目,并且正在执行一些故障模式测试。这样做时,我发现(我认为是)意外行为。
如果领导 Zookeeper 进程被杀死,follower 是否应该重新启动?
环境:
OS: Windows Server 2008 R2 (hosted in a Tanuki Java service wrapper)
Zookeeper: 3.4.6
Java JDK: 1.7.0.210
测试:
测试是杀死 Zookeeper 进程并确保集群恢复。
如果我杀死一个 非领导者 进程,它会重新启动并重新加入集群,而不会影响其他节点。
如果我终止 leader 进程,leader 和 followers 将重新启动。这似乎不对,因为有一段时间客户端无法连接到 any Zookeeper 节点。
我已经尝试了 TCP 和 UDP 通信设置,但两者表现出相同的行为。 UDP 的恢复速度是原来的两倍。
Zookeeper 设置
tickTime=2000
initLimit=5
syncLimit=2
minSessionTimeout=5000
maxSessionTimeout=120000
dataDir=C:\ProgramData\Saab OneView\ZooKeeper\zoo-data
clientPort=2181
leaderServes=yes
autopurge.purgeInterval=24
# IP addresses blanked out here
server.1=0.0.0.1:2888:3888
server.2=0.0.0.2:2888:3888
server.3=0.0.0.3:2888:3888
server.4=0.0.0.4:2888:3888
server.5=0.0.0.5:2888:3888
# This is for zookeeper->zookeeper communication
# I've tried both settings, UDP has faster recovery time
# 0 = UDP
# 3 = TCP (default)
electionAlg=3
Sample follower异常导致关机
20160309 05:35:51.958Z 20160309 05:35:51.958 [myid:3] - WARN [RecvWorker:4:QuorumCnxManager$RecvWorker@780] - Connection broken for id 4, my id = 3, error =
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.DataInputStream.readInt(Unknown Source)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:765)
20160309 05:35:51.959Z 20160309 05:35:51.959 [myid:3] - WARN [RecvWorker:4:QuorumCnxManager$RecvWorker@783] - Interrupting SendWorker
20160309 05:35:51.959Z 20160309 05:35:51.959 [myid:3] - WARN [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.DataInputStream.readInt(Unknown Source)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
20160309 05:35:51.960Z 20160309 05:35:51.960 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
基于 ZOOKEEPER-3478 这是预期的行为:
It is normal behaviour that all the followers shutdown during a leader election. Since there is no leader after a leader crash, the servers that used to be followers are not followers anymore. So the followers shutdown and go back to LOOKING state in order to find the new leader.