没有堆栈跟踪的嵌入式 neo4j 崩溃

Embedded neo4j crash with no stacktrace

我运行宁neo4j 2.3.0-RC1嵌入,使用JavaAPI。它一直在没有警告的情况下崩溃,我正在尝试找出原因。

我之前使用此代码与 1.9.8 配合使用效果很好。升级到 2.0+ 需要添加事务、更改一些密码语法、启动时 Spring 配置以及少量有限数量的其他更改。

绝大部分代码保持不变,并且在功能上是正确的,经单元和集成测试证实。

启动引擎时,它会相当稳定地添加新节点。下面的输出显示了 290 分钟后的神秘崩溃。

这似乎总是发生。有时 2 小时后,有时 5 小时后。1.9.8.

根本没有发生过

JVM 是 运行 和 ./start-engine.sh > console.out 2>&1 &

start-engine.sh的有效线是

$JAVA_HOME/bin/java -server $JAVA_OPTIONS $JPROFILER_OPTIONS -cp '.:lib/*' package.engine.Main $*

下面是console.out的最后几行。

17437.902: RevokeBias                       [     112          6              5    ]      [    20     6    27    43    26    ]  1
17438.020: RevokeBias                       [     112          3              9    ]      [     5     0     5     0     0    ]  3
17438.338: GenCollectForAllocation          [     113          2              2    ]      [     1     0    11     4    32    ]  2
17438.857: BulkRevokeBias                   [     112          3             13    ]      [     0     0    28     6     2    ]  3
./start-engine.sh: line 17: 19647 Killed                  $JAVA_HOME/bin/java -server $JAVA_OPTIONS $JPROFILER_OPTIONS -cp '.:lib/*' package.engine.Main $*

没有堆栈跟踪,也没有其他错误输出。

这些是 messages.log 来自 /mnt/engine-data

的最后几行
2015-10-30 18:07:44.457+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by scheduler for time threshold [845664646]:  Starting check pointing...
2015-10-30 18:07:44.458+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by scheduler for time threshold [845664646]:  Starting store flush...
2015-10-30 18:07:44.564+0000 INFO  [o.n.k.i.s.c.CountsTracker] About to rotate counts store at transaction 845664650 to [/mnt/engine-data/neostore.counts.db.b], from [/mnt/engine-data/neostore.counts.db.a].
2015-10-30 18:07:44.565+0000 INFO  [o.n.k.i.s.c.CountsTracker] Successfully rotated counts store at transaction 845664650 to [/mnt/engine-data/neostore.counts.db.b], from [/mnt/engine-data/neostore.counts.db.a].
2015-10-30 18:07:44.834+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by scheduler for time threshold [845664646]:  Store flush completed
2015-10-30 18:07:44.835+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by scheduler for time threshold [845664646]:  Starting appending check point entry into the tx log...
2015-10-30 18:07:44.836+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by scheduler for time threshold [845664646]:  Appending check point entry into the tx log completed
2015-10-30 18:07:44.836+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] Check Pointing triggered by scheduler for time threshold [845664646]:  Check pointing completed
2015-10-30 18:07:44.836+0000 INFO  [o.n.k.i.t.l.p.LogPruningImpl] Log Rotation [35826]:  Starting log pruning.
2015-10-30 18:07:44.844+0000 INFO  [o.n.k.i.t.l.p.LogPruningImpl] Log Rotation [35826]:  Log pruning complete.

所以在崩溃之前一切看起来都很好,而崩溃完全出乎意料。

messages.log 中还有很多其他数据,但我不知道我在找什么。


$ java -version
java version "1.7.0_65"
Java(TM) SE Runtime Environment (build 1.7.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

$uname -a
Linux 3.13.0-65-generic #106-Ubuntu SMP Fri Oct 2 22:08:27 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

您可能会看到 Linux 内存不足杀手的效果,它会在系统 运行 物理内存严重不足时终止进程。这可以解释为什么您在日志中什么也没找到。

引用this excellent article:

Because many applications allocate their memory up front and often don't utilize the memory allocated, the kernel was designed with the ability to over-commit memory to make memory usage more efficient. ……… When too many applications start utilizing the memory they were allocated, the over-commit model sometimes becomes problematic and the kernel must start killing processes …

上面引用的文章是了解 OOM Killer 的重要资源,其中包含大量有关如何进行故障排除和配置的信息Linux 以尽量避免该问题。

并引用 this question 的答案:

The OOM Killer has to select the best process to kill. Best here refers to that process which will free up maximum memory upon killing and is also least important to the system.

因为 neo4j 进程很可能是您系统上内存最密集的进程,所以当物理资源开始 运行 时它会被杀死是有道理的。

避免 OOM Killer 的一种方法是尽量让其他内存密集型进程远离同一系统。这应该会大大降低内存过度使用的可能性。但是你至少应该阅读上面的第一篇文章,以更好地理解 OOM Killer -- 有很多东西要知道。