Nodetool Rebuild :- 一段时间后流失败

Nodetool Rebuild :- Stream Failing afer sometime

我已将新节点添加到我现有的 单节点 cassandra 集群中。它有大约 48gb 的数据。 只有一个键空间负责,它的 复制因子 '2'(我在添加新节点后更改了它)。我正在尝试 运行 在新节点上重建 nodetool,以便可以将数据从种子节点流式传输到它。 在 t运行sferring 36gb 数据后流结束,节点关闭。所以我重复了这个过程,但是在 t运行sferring 一些数据(12-25 gb)之后流继续失败。 它以以下错误结束。


error: Error while rebuilding node: Stream failed
-- StackTrace --
java.lang.RuntimeException: Error while rebuilding node: Stream failed
    at org.apache.cassandra.service.StorageService.rebuild(StorageService.java:1319)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
    at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
    at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
    at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
    at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
    at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
    at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
    at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1468)
    at javax.management.remote.rmi.RMIConnectionImpl.access0(RMIConnectionImpl.java:76)
    at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309)
    at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401)
    at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:829)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
    at sun.rmi.transport.Transport.run(Transport.java:200)
    at sun.rmi.transport.Transport.run(Transport.java:197)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
    at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run[=10=](TCPTransport.java:688)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

P.S。我已确保 streaming_socket_timeout_in_ms 设置为至少 24 小时。 请帮帮我。

谢谢。

更新:- 我 运行 nodetool rebuild keyspace_name 而不是 nodetool rebuild 并且它再次以这个错误结束。

WARN  [StreamReceiveTask:9] 2019-10-23 11:14:41,522 StreamResultFuture.java:214 - [Stream #b9b051b0-f580-11e9-92dd-9765711f899a] Stream failed
ERROR [RMI TCP Connection(12)-10.128.1.3] 2019-10-23 11:14:42,316 StorageService.java:1318 - Error while rebuilding node
org.apache.cassandra.streaming.StreamException: Stream failed
        at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:215) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:191) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:481) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:571) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:281) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_222]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_222]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_222]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_222]
        at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator[=11=](NamedThreadFactory.java:81) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_222]
INFO  [Service Thread] 2019-10-23 11:14:43,223 GCInspector.java:284 - ConcurrentMarkSweep GC in 310ms.  CMS Old Gen: 2391324840 -> 639245216; Code Cache: 38320192 -> 38627904; Compressed Class Space: 554$
ERROR [STREAM-IN-/10.128.1.1:7000] 2019-10-23 11:14:48,769 StreamSession.java:593 - [Stream #b9b051b0-f580-11e9-92dd-9765711f899a] Streaming error occurred on session with peer 10.128.1.1
java.lang.RuntimeException: Outgoing stream handler has been closed
        at org.apache.cassandra.streaming.ConnectionHandler.sendMessage(ConnectionHandler.java:143) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:655) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:523) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:317) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]

更新 2 :- 我尝试在新节点上再次进行 nodetool rebuild 在 t运行sfering 大约 95% 的数据后,流再次失败。 这是流式节点的日志

INFO  [STREAM-INIT-/10.128.1.3:56486] 2019-10-23 11:16:03,497 StreamResultFuture.java:116 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a ID#0] Creating new streaming plan for Rebuild
INFO  [STREAM-INIT-/10.128.1.3:56486] 2019-10-23 11:16:03,498 StreamResultFuture.java:123 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a, ID#0] Received streaming plan for Rebuild
INFO  [STREAM-INIT-/10.128.1.3:56488] 2019-10-23 11:16:03,498 StreamResultFuture.java:123 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a, ID#0] Received streaming plan for Rebuild
INFO  [STREAM-IN-/10.128.1.3:56488] 2019-10-23 11:16:03,600 StreamResultFuture.java:173 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a ID#0] Prepare completed. Receiving 0 files(0.000KiB), sending 133 f$
INFO  [Service Thread] 2019-10-23 11:19:14,472 GCInspector.java:284 - ParNew GC in 517ms.  CMS Old Gen: 104131728 -> 121315352; Par Eden Space: 1342177280 -> 0; Par Survivor Space: 67963984 -> 61263088
ERROR [STREAM-IN-/10.128.1.3:56488] 2019-10-23 11:56:43,902 StreamSession.java:706 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a] Remote peer 10.128.1.3 failed stream session.
INFO  [IndexSummaryManager:1] 2019-10-23 11:58:32,284 IndexSummaryRedistribution.java:77 - Redistributing index summaries
INFO  [STREAM-IN-/10.128.1.3:56488] 2019-10-23 11:59:38,687 StreamResultFuture.java:187 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a] Session with /10.128.1.3 is complete
ERROR [STREAM-OUT-/10.128.1.3:56486] 2019-10-23 11:59:38,688 StreamSession.java:593 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a] Streaming error occurred on session with peer 10.128.1.3
java.lang.RuntimeException: Transfer of file /var/lib/cassandra/data/thingsboard/ts_kv_cf-53b7bf3096ec11e99154356269723c5c/md-583-big-Data.db already completed or aborted (perhaps session failed?).
        at org.apache.cassandra.streaming.messages.OutgoingFileMessage.startTransfer(OutgoingFileMessage.java:119) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.messages.OutgoingFileMessage.serialize(OutgoingFileMessage.java:49) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.messages.OutgoingFileMessage.serialize(OutgoingFileMessage.java:41) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:50) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:408) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:380) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
WARN  [STREAM-IN-/10.128.1.3:56488] 2019-10-23 11:59:38,688 StreamResultFuture.java:214 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a] Stream failed
INFO  [STREAM-INIT-/10.128.1.3:56674] 2019-10-23 12:03:24,860 StreamResultFuture.java:116 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a ID#0] Creating new streaming plan for Rebuild
INFO  [STREAM-INIT-/10.128.1.3:56674] 2019-10-23 12:03:24,861 StreamResultFuture.java:123 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a, ID#0] Received streaming plan for Rebuild
INFO  [STREAM-INIT-/10.128.1.3:56676] 2019-10-23 12:03:24,861 StreamResultFuture.java:123 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a, ID#0] Received streaming plan for Rebuild
INFO  [STREAM-IN-/10.128.1.3:56676] 2019-10-23 12:03:24,950 StreamResultFuture.java:173 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a ID#0] Prepare completed. Receiving 0 files(0.000KiB), sending 133 f$
INFO  [Service Thread] 2019-10-23 12:04:18,160 GCInspector.java:284 - ParNew GC in 307ms.  CMS Old Gen: 124972984 -> 125070416; Par Eden Space: 1342177280 -> 0; Par Survivor Space: 61042328 -> 82423296
INFO  [GossipStage:1] 2019-10-23 12:27:39,200 Gossiper.java:1026 - InetAddress /10.128.1.3 is now DOWN
INFO  [HANDSHAKE-/10.128.1.3] 2019-10-23 12:27:39,424 OutboundTcpConnection.java:561 - Handshaking version with /10.128.1.3
ERROR [STREAM-IN-/10.128.1.3:56676] 2019-10-23 12:27:45,107 StreamSession.java:593 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a] Streaming error occurred on session with peer 10.128.1.3
java.net.SocketException: End-of-stream reached
        at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:71) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:311) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
INFO  [STREAM-IN-/10.128.1.3:56676] 2019-10-23 12:27:45,108 StreamResultFuture.java:187 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a] Session with /10.128.1.3 is complete
ERROR [STREAM-OUT-/10.128.1.3:56674] 2019-10-23 12:27:45,108 StreamSession.java:593 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a] Streaming error occurred on session with peer 10.128.1.3
org.apache.cassandra.io.FSReadError: java.io.IOException: Broken pipe
        at org.apache.cassandra.io.util.ChannelProxy.transferTo(ChannelProxy.java:145) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.compress.CompressedStreamWriter.lambda$write[=12=](CompressedStreamWriter.java:85) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.applyToChannel(BufferedDataOutputStreamPlus.java:350) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.compress.CompressedStreamWriter.write(CompressedStreamWriter.java:85) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.messages.OutgoingFileMessage.serialize(OutgoingFileMessage.java:101) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.messages.OutgoingFileMessage.serialize(OutgoingFileMessage.java:52) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.messages.OutgoingFileMessage.serialize(OutgoingFileMessage.java:41) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:50) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:408) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:380) ~[apache-cassandra-3.11.4.jar:3.11.4]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
Caused by: java.io.IOException: Broken pipe
        at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) ~[na:1.8.0_222]
        at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) ~[na:1.8.0_222]
        at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) ~[na:1.8.0_222]
        at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605) ~[na:1.8.0_222]
        at org.apache.cassandra.io.util.ChannelProxy.transferTo(ChannelProxy.java:141) ~[apache-cassandra-3.11.4.jar:3.11.4]
        ... 10 common frames omitted
WARN  [STREAM-IN-/10.128.1.3:56676] 2019-10-23 12:27:45,108 StreamResultFuture.java:214 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a] Stream failed
INFO  [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:19,854 Gossiper.java:525 - Removing host: 0e8ad28d-6cc2-46df-8d3f-f346d464db40
INFO  [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:19,854 Gossiper.java:526 - Sleeping for 30000ms to ensure /10.128.1.3 does not change
INFO  [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:49,854 Gossiper.java:533 - Advertising removal for /10.128.1.3
INFO  [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:50,245 StreamResultFuture.java:90 - [Stream #aae08f50-f590-11e9-9934-850cf6bcace3] Executing streaming plan for Restore replica count
INFO  [MiscStage:1] 2019-10-23 12:28:50,247 StorageService.java:4459 - Received unexpected REPLICATION_FINISHED message from /10.128.1.1. Was this node recently a removal coordinator?
INFO  [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:50,248 StorageService.java:2584 - Removing tokens [-9135980046459212380, -9100471967410923634, -9097242662756219549, -8974765285872613713, -895$
INFO  [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:50,317 Gossiper.java:557 - Completing removal of /10.128.1.3
INFO  [HANDSHAKE-/10.128.1.3] 2019-10-23 12:31:35,019 OutboundTcpConnection.java:561 - Handshaking version with /10.128.1.3

我完全不知道为什么会失败。 谁能指出我正确的方向? 我已确保没有防火墙问题,我也没有使用 SSL 进行节点间通信。

org.apache.cassandra.io.FSReadError: java.io.IOException: Broken pipe

这可能有多种原因(损坏的数据、网络问题、两个节点之间的模式问题),但基本上这意味着连接被切断并终止了正在进行的流式传输。

最有可能是网络问题。如果您有任何网络指标,请尝试使用它们来调试连接。

在这里您可以做一些事情来尝试变聪明,主要是减少您需要做的流式处理的量。您可以通过以下方式实现:

  1. 将密钥空间 RF 减少回 1
  2. 在cassandra.yaml
  3. 中使用auto_bootstrap: true添加节点
  4. 将 RF 重新增加到 2
  5. 正在修复数据

这将产生相同的结果,您创建了 2 个都包含 100% 数据的节点,但在节点建立过程中,您只传输了 1/2 的数据。然后在较小的会话(较小的工作单元)中进行修复,恢复丢失的任何其他数据,使您恢复到 100%。

旁注 我的建议是您开始定期对您的节点进行快照,因为似乎有健康状况不佳的迹象。 运行 Cassandra 的单个节点意味着您并没有真正防止数据丢失,这就是为什么 C* 是分布式的,为什么 replication_factor 3 被推荐用于大多数设置。