同一数据中心的 Cassandra 节点给出不同的查询 results/errors

Cassandra nodes in same Datacenter give different query results/errors

我在使用具有多个数据中心的 cassandra 集群时遇到问题，每个数据中心有 3 个节点，每个数据中心有 2 个节点作为种子：

我有一个 ReplicationFactor 为 3 的键空间 X，它在数据中心 DC1 中有 3 个副本，在数据中心 DC2 中有 3 个副本 (KEYSPACE X WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '3', 'DC2': '3'} AND durable_writes = true;)

现在，我所做的（也许我在这里遗漏了什么）是我 cqlsh 到数据中心 DC2 中的每个节点（比如 node2A、node2B 和 node2C）并执行以下操作：

cqlsh node2N
所有一致性
select * 来自 x.table;

并且通过将一致性设置为 ALL，我知道我必须从每个节点获得响应，3 个属于 DC1，3 个属于 DC2，总共 6 个响应。但取而代之的是，我在每个节点中得到了 3 个不同的结果：

node2A：查询失败 Cannot achieve consistency level ALL info: {'required_replicas': 6, 'alive_replicas': 5, 'consistency': ALL}
node2B: 查询成功并且returns table 数据
node2C：查询需要1-2分钟然后returns一个Coordinator node timed out waiting for replica nodes' responses. Operation timed out - received only 5 responses. info: {'received_responses': 5, 'required_responses': 6, 'consistency': ALL}

我在 cqlsh 中执行这些查询的原因是因为我们的一个应用程序在查询 cassandra 时表现不稳定（比如 QUORUM 没有足够的副本等），我怀疑我们可能有一些问题节点之间的通信。八卦要么是向不同的节点讲述不同的事情，要么是类似的事情。从每个节点到任何其他节点的通信都有效（我们可以使用 cqlsh、ssh 和一切）。

难道我的理论是正确的，我们在配置上有某种不一致吗？如果是这样，我该如何调试这些故障？有没有办法知道哪个节点不活动或没有响应，以便我可以更仔细地查看它的通信？我试过 "tracing on" 但它只适用于成功的查询，所以我只在 node2B 中得到跟踪（顺便说一句，行为在同一个节点上并不总是相同的，它似乎是随机的）

如果不是，我的 cqlsh 测试是否有效？还是我在这里遗漏了 cassandra 难题的一些重要部分？

非常感谢，我在这里要疯了....

编辑：根据要求，这是 nodetool describecluster 的输出。我在 DC2 的所有 3 个节点中都这样做了并且：

node2A:

Cluster Information: Name: Cassandra Cluster Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: 19ada8a5-4688-3fa8-9479-e612388f67ee: [node2A, node2B, node1A, node1B, node1C, other IPs from other nodes (from other datacenters and keyspaces)]

node2B:

Cluster Information: Name: Cassandra Cluster Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: 19ada8a5-4688-3fa8-9479-e612388f67ee: [node2A, node2B, node2C, node1A, node1B, node1C, other IPs from other nodes (from other datacenters and keyspaces)] UNREACHABLE: [couple of IPs from other datacenter/keyspaces]

node2C:

Cluster Information: Name: Cassandra Cluster Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: 19ada8a5-4688-3fa8-9479-e612388f67ee: [node2B, node2C, node1A, node1B, node1C, other IPs from other nodes (from other datacenters and keyspaces)] UNREACHABLE: [node2A and other IPs]

值得注意的是，在 node2A 中没有 node2C，在 node2B 中出现了所有 3 个节点，在 node2C 中我们将 node2A 设置为不可访问...

我感觉这是非常错误的，不知何故...

我刚刚执行了 "nodetool status keyspaceX" 结果如下：

node2A:

Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN node2A 67,78 MB 256 100,0% - RAC1 UN node2B 67,18 MB 256 100,0% - RAC1 ?N node2C 67,11 MB 256 100,0% - RAC1

node2B:

node2C:

现在，node2A怎么不知道node2C的状态（它显示为？并且没有出现在describecluster的SchemaVersion中）？但是为什么在 descriccluster 中从 node2A 抱怨 UNREACHABLE 的 node2C 根据状态知道 node2A 已启动？

首先，您可以检查任何节点是否可达，您可以运行 nodetool 描述集群并分析输出。

节点之间的通信是通过端口 7000 而不是通过 ssh 或 cqlsh 通过八卦和消息交换发生的。

关于以上3个问题：-

当您运行查询时，可能无法访问任何节点那时你没有达到使用ALL的一致性。
这个时间节点是存活的并且达到了一致性，你得到了数据.
在这种情况下，协调器节点并没有从内部的所有节点获取数据时间和通过超时异常。它可以设置在 cassandra.yaml.

希望能回答您的问题。

这与 cassandra 的内部问题有关。由于一些损坏的提示文件，gossip 进程正在关闭，但其余的 cassandra 进程已启动并且运行因此该节点看到其他所有人，但其余的说它已关闭，因为 Gossiper 已关闭（实际端口9160异常后被关闭）

Exception screenshot

实际的 cassandra 问题是 https://issues.apache.org/jira/browse/CASSANDRA-12728

希望有用

同一数据中心的 Cassandra 节点给出不同的查询 results/errors

Cassandra nodes in same Datacenter give different query results/errors

consistency

cassandra