搜索 thread_pool 特定节点总是最大

Question

我有一个 elasticsearch 集群，有 6 个节点。 heapsize 设置为 50GB。（我知道建议小于 32，但由于某些我不知道的原因，它已经设置为 50Gb）。现在我看到很多来自搜索的拒绝 thread_pool。

这是我当前的搜索 thread_pool:

node_name               name   active rejected  completed
1105-IDC.node          search      0 19295154 1741362188
1108-IDC.node          search      0  3362344 1660241184
1103-IDC.node          search     49 28763055 1695435484
1102-IDC.node          search      0  7715608 1734602881
1106-IDC.node          search      0 14484381 1840694326
1107-IDC.node          search     49 22470219 1641504395

我注意到两个节点总是有最大活动线程（1103-IDC.node & 1107-IDC.node）。即使其他节点也有拒绝，但这些节点的拒绝最高。硬件与其他节点类似。这可能是什么原因？难道是因为他们有什么特定的碎片，命中率更高？如果有，如何找到它们？

此外，在活动线程始终最大的节点上，年轻堆占用的时间超过 70 毫秒（有时大约 200 毫秒）。从 GC 日志中找到以下几行：

[2020-10-27T04:32:14.380+0000][53678][gc             ] GC(6768757) Pause Young (Allocation Failure) 27884M->26366M(51008M) 196.226ms
[2020-10-27T04:32:26.206+0000][53678][gc,start       ] GC(6768758) Pause Young (Allocation Failure)
[2020-10-27T04:32:26.313+0000][53678][gc             ] GC(6768758) Pause Young (Allocation Failure) 27897M->26444M(51008M) 107.850ms
[2020-10-27T04:32:35.466+0000][53678][gc,start       ] GC(6768759) Pause Young (Allocation Failure)
[2020-10-27T04:32:35.574+0000][53678][gc             ] GC(6768759) Pause Young (Allocation Failure) 27975M->26444M(51008M) 108.923ms
[2020-10-27T04:32:40.993+0000][53678][gc,start       ] GC(6768760) Pause Young (Allocation Failure)
[2020-10-27T04:32:41.077+0000][53678][gc             ] GC(6768760) Pause Young (Allocation Failure) 27975M->26427M(51008M) 84.411ms
[2020-10-27T04:32:45.132+0000][53678][gc,start       ] GC(6768761) Pause Young (Allocation Failure)
[2020-10-27T04:32:45.200+0000][53678][gc             ] GC(6768761) Pause Young (Allocation Failure) 27958M->26471M(51008M) 68.105ms
[2020-10-27T04:32:46.984+0000][53678][gc,start       ] GC(6768762) Pause Young (Allocation Failure)
[2020-10-27T04:32:47.046+0000][53678][gc             ] GC(6768762) Pause Young (Allocation Failure) 28001M->26497M(51008M) 62.678ms
[2020-10-27T04:32:56.641+0000][53678][gc,start       ] GC(6768763) Pause Young (Allocation Failure)
[2020-10-27T04:32:56.719+0000][53678][gc             ] GC(6768763) Pause Young (Allocation Failure) 28027M->26484M(51008M) 77.596ms
[2020-10-27T04:33:29.488+0000][53678][gc,start       ] GC(6768764) Pause Young (Allocation Failure)
[2020-10-27T04:33:29.740+0000][53678][gc             ] GC(6768764) Pause Young (Allocation Failure) 28015M->26516M(51008M) 251.447ms

Answer 1

需要注意的一件重要事情是，如果您从 elasticsearch threadpool cat API 获得这些统计数据，那么它只会显示 point-in-time 数据，不会显示过去 1 小时的历史数据，6 hr, 1 天, 1 周就这样。

rejected 和 completed 是节点上次重启的统计数据，所以当我们试图弄清楚某些 ES 节点是否由于 [= 而变得 hot-spots 时，这也不是很有帮助27=] 分片配置。

所以这里我们有两个非常重要的事情需要弄清楚

确保，我们通过按时间范围查看数据节点上的平均活跃请求、拒绝请求来了解集群中的实际热点节点（您可以只检查高峰时段）。
一旦知道热点节点，查看分配给它们的分片，并将其与其他节点分片进行比较，要检查的指标很少，分片数量，分片接收更多流量，分片接收最慢查询等 并且其中大部分你必须通过查看各种指标和 ES 的 API 来弄清楚，这可能非常耗时并且需要大量内部 ES 知识。

搜索 thread_pool 特定节点总是最大

Search thread_pool for particular nodes always at maximum

java

garbage-collection

memory-leaks

elasticsearch