Apache Storm 停用拓扑导致 cpu 利用率高

Question

我遇到停用的 apache 风暴拓扑 cpu 使用率高的问题。我可以使用以下步骤可靠地重现问题，但我还没有确定确切的原因或解决方案。

环境是一个storm集群，1个拓扑是运行（拓扑极其简单，我用的是感叹的例子）。它是不活跃的。最初有正常的 CPU 用法。然而，当我终止所有主管上的所有拓扑 JVM 进程并让 Storm 再次重新启动它们时，我发现一段时间后（~9 小时）每个 JVM 进程的 CPU 使用率飙升至近 100%。我已经测试了一个 ACTIVE 拓扑，但并没有发生这种情况。我还测试了不止一种拓扑，当它们处于 INACTIVE 状态时观察到相同的结果。

重新创建的步骤：

运行 Apache Storm 集群上的 1 个拓扑
停用它
Kill all topology JVM processes on all supervisors (Storm will restart them)
观察 CPU 对于所有 INACTIVE 拓扑 JVM 进程，Supervisors 的使用率飙升至近 100%。

环境

Apache Storm 1.1.0 运行在 3 个虚拟机、1 个 nimbus 和 2 个主管上。

集群摘要：

主管：2
已用插槽：2
可用插槽：38
插槽总数：40
执行者：50
任务：50

拓扑有 2 个工人和 50 个 executors/tasks（线程）。

目前调查：

除了能够可靠地重现问题之外，我还确定了受影响的拓扑 JVM 进程使用最多的线程 CPU。进程中总共有 102 个线程，97 个阻塞，5 IN_NATIVE。使用最多 CPU 的线程是相同的，共有 23 个（均处于 BLOCKED 状态）：

Thread 28558: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 (Compiled frame)
 - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 (Compiled frame)
 - com.lmax.disruptor.RingBuffer.next(int) @bci=5, line=260 (Interpreted frame)
 - org.apache.storm.utils.DisruptorQueue.publishDirect(java.util.ArrayList, boolean) @bci=18, line=517 (Interpreted frame)
 - org.apache.storm.utils.DisruptorQueue.access00(org.apache.storm.utils.DisruptorQueue, java.util.ArrayList, boolean) @bci=3, line=61 (Interpreted frame)
 - org.apache.storm.utils.DisruptorQueue$ThreadLocalBatcher.flush(boolean) @bci=50, line=280 (Interpreted frame)
 - org.apache.storm.utils.DisruptorQueue$Flusher.run() @bci=55, line=303 (Interpreted frame)
 - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 (Compiled frame)
 - java.util.concurrent.FutureTask.run() @bci=42, line=266 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1142 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=617 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)

我通过使用 jstack 获取进程的线程转储来识别此线程：

jstack -F <pid> > jstack-<pid>.txt

和top来识别进程中使用最多的线程CPU:

top -H -p <pid>

有没有人遇到过这个或类似的问题？任何帮助将不胜感激。

Answer 1

问题的发生是因为 DisruptorQueue 中的 RingBuffer 已满，当发布线程试图申请一个插槽时，它们实际上被卡在 LockSupport.parkNanos(1L) 中。根据我对 Storm 的评论 JIRA

Apache Storm 停用拓扑导致 cpu 利用率高

Apache Storm deactivated topologies cause high cpu utilization

java

jvm

apache-storm