Openstack 实例上的 Elasticsearch 7 无法设置 ES 集群

Elasticsearch 7 on Openstack instances unable to setup ES cluster

我正在尝试在 Openstack 上设置 Elasticsearch 集群。我有两个 Openstack 实例,每个实例 运行 ES,并且这些实例能够相互 ping 并卷曲彼此的 ES 实例。但是无论我如何配置 Elasticsearch.yml 文件,我似乎都无法让它们形成一个集群。

我在两个实例上都使用 Elasticsearch 7.3.2。我在以下配置中使用非浮动 IP。

实例 1 - Elasticsearch.yml

cluster.name: my-cluster
node.name: node-1
node.master: true
node.data: true
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: [_local_,_site_]
http.port: 9200
discovery.seed_hosts: ["<INSTANCE1-IP>:9300", "<INSTANCE2-IP>:9300"]
cluster.initial_master_nodes: ["<INSTANCE2-IP>:9300"]

实例 2 - Elasticsearch.yml

cluster.name: my-cluster
node.name: node-2
node.master: false
node.data: true
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: [_local_,_site_]
http.port: 9200
discovery.seed_hosts: ["<INSTANCE1-IP>:9300", "<INSTANCE2-IP>:9300"]
cluster.initial_master_nodes: ["<INSTANCE2-IP>:9300"]

使用这些配置,主节点 (Instance1) 加载正常,但在检查第二个节点 (Instance2) 的运行状况时,我得到 master_not_discovered_exception (503)。有什么想法吗?

检查节点 2 上的日志显示以下信息:

[2019-10-01T09:45:53,126][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [node-2] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2019-10-01T09:45:53,127][WARN ][r.suppressed             ] [node-2] path: /_cluster/health, params: {pretty=}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.onTimeout(TransportMasterNodeAction.java:251) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:572) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.2.jar:7.3.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:835) [?:?]
[2019-10-01T09:45:59,754][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node-2] master not discovered yet: have discovered [{node-2}{Sb2fOZEKR42_4sB2XnmShg}{LgJ0DLojSay7KV2_cXgdpw}{<INSTANCE2-IP>}{<INSTANCE2-IP>:9300}{di}{ml.machine_memory=33728778240, xpack.installed=true, ml.max_open_jobs=20}, {node-1}{HqONwd3fQHSWZxxEtctcog}{cu5KW146S8-04oBBCqL3QA}{<INSTANCE1-IP>}{<INSTANCE1-IP>:9300}{dim}{ml.machine_memory=33728778240, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [<INSTANCE2-IP>:9300] from hosts providers and [] from last-known cluster state; node term 0, last-accepted version 0 in term 0

多亏了这个thread,我才设法解决了这个问题。它混合了更改一些配置和通过删除节点数据执行完整的 ES 重启。

我执行的步骤:

1) 访问日志以识别错误:

sudo -i
cd /var/log/elasticsearch
cat my-cluster.log
[2019-10-01T09:45:53,126][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [node-2] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2019-10-01T09:45:53,127][WARN ][r.suppressed             ] [node-2] path: /_cluster/health, params: {pretty=}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.onTimeout(TransportMasterNodeAction.java:251) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:572) [elasticsearch-7.3.2.jar:7.3.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.2.jar:7.3.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:835) [?:?]
[2019-10-01T09:45:59,754][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node-2] master not discovered yet: have discovered [{node-2}{Sb2fOZEKR42_4sB2XnmShg}{LgJ0DLojSay7KV2_cXgdpw}{<INSTANCE2-IP>}{<INSTANCE2-IP>:9300}{di}{ml.machine_memory=33728778240, xpack.installed=true, ml.max_open_jobs=20}, {node-1}{HqONwd3fQHSWZxxEtctcog}{cu5KW146S8-04oBBCqL3QA}{<INSTANCE1-IP>}{<INSTANCE1-IP>:9300}{dim}{ml.machine_memory=33728778240, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [<INSTANCE2-IP>:9300] from hosts providers and [] from last-known cluster state; node term 0, last-accepted version 0 in term 0

2) 更改 cluster.initial_master_nodes 以使用节点名称而不是节点 IP:

cluster.initial_master_nodes: ["node-1"]

3) 删除现有节点数据并在两个节点上重启Elasticsearch

sudo rm -rf /var/lib/elasticsearch
sudo service elasticsearch restart