H2O 服务器崩溃
H2O server crash
去年我一直在使用 H2O,我对服务器崩溃感到非常厌倦。我已经放弃 "nightly releases",因为它们很容易被我的数据集崩溃。请告诉我在哪里可以下载稳定的版本。
查尔斯
我的环境是:
- Windows 10 个企业,build 1607,64 GB 内存。
- Java SE 开发套件 8 更新 77(64 位)。
- 蟒蛇 Python 3.6.2-0.
我用以下命令启动了服务器:
localH2O = h2o.init(ip = "localhost",
port = 54321,
max_mem_size="12G",
nthreads = 4)
h2o初始化信息为:
H2O cluster uptime: 12 hours 12 mins
H2O cluster version: 3.10.5.2
H2O cluster version age: 1 month and 6 days
H2O cluster name: H2O_from_python_Charles_ji1ndk
H2O cluster total nodes: 1
H2O cluster free memory: 6.994 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 4
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy:
H2O internal security: False
Python version: 3.6.2 final
崩溃信息为:
OSError: Job with key 017f00000132d4ffffffff$_a0ce9b2c855ea5cff1aa58d65c2a4e7c failed with an exception: java.lang.AssertionError: I am really confused about the heap usage; MEM_MAX=11453595648 heapUsedGC=11482667352
stacktrace:
java.lang.AssertionError: I am really confused about the heap usage; MEM_MAX=11453595648 heapUsedGC=11482667352
at water.MemoryManager.set_goals(MemoryManager.java:97)
at water.MemoryManager.malloc(MemoryManager.java:265)
at water.MemoryManager.malloc(MemoryManager.java:222)
at water.MemoryManager.arrayCopyOfRange(MemoryManager.java:291)
at water.AutoBuffer.expandByteBuffer(AutoBuffer.java:719)
at water.AutoBuffer.putA4f(AutoBuffer.java:1355)
at hex.deeplearning.Storage$DenseRowMatrix$Icer.write129(Storage$DenseRowMatrix$Icer.java)
at hex.deeplearning.Storage$DenseRowMatrix$Icer.write(Storage$DenseRowMatrix$Icer.java)
at water.Iced.write(Iced.java:61)
at water.AutoBuffer.put(AutoBuffer.java:771)
at water.AutoBuffer.putA(AutoBuffer.java:883)
at hex.deeplearning.DeepLearningModelInfo$Icer.write128(DeepLearningModelInfo$Icer.java)
at hex.deeplearning.DeepLearningModelInfo$Icer.write(DeepLearningModelInfo$Icer.java)
at water.Iced.write(Iced.java:61)
at water.AutoBuffer.put(AutoBuffer.java:771)
at hex.deeplearning.DeepLearningModel$Icer.write105(DeepLearningModel$Icer.java)
at hex.deeplearning.DeepLearningModel$Icer.write(DeepLearningModel$Icer.java)
at water.Iced.write(Iced.java:61)
at water.Iced.asBytes(Iced.java:42)
at water.Value.<init>(Value.java:348)
at water.TAtomic.atomic(TAtomic.java:22)
at water.Atomic.compute2(Atomic.java:56)
at water.Atomic.fork(Atomic.java:39)
at water.Atomic.invoke(Atomic.java:31)
at water.Lockable.unlock(Lockable.java:181)
at water.Lockable.unlock(Lockable.java:176)
at hex.deeplearning.DeepLearning$DeepLearningDriver.trainModel(DeepLearning.java:491)
at hex.deeplearning.DeepLearning$DeepLearningDriver.buildModel(DeepLearning.java:311)
at hex.deeplearning.DeepLearning$DeepLearningDriver.computeImpl(DeepLearning.java:216)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
at hex.deeplearning.DeepLearning$DeepLearningDriver.compute2(DeepLearning.java:209)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
你需要一艘更大的船。
错误消息显示 "heapUsedGC=11482667352",高于 MEM_MAX。为什么不给它更多的 64GB,而不是给 max_mem_size="12G"
?或者构建一个不那么雄心勃勃的模型(更少的隐藏节点,更少的训练数据,诸如此类)。
(显然,理想情况下,h2o 不应该崩溃,而应该在它接近使用所有可用内存时优雅地中止。如果您能够与 H2O 共享您的 data/code,它可能值得在他们的 JIRA 上打开错误报告。)
顺便说一句,我已经 运行 h2o 3.10.x.x 作为 web 服务器进程的后端 9 个月左右,在周末自动重启它,而且还没有有一次崩溃。好吧,我做到了 - 在我离开它 运行 3 周后,它用越来越多的数据和模型填满了所有内存。这就是为什么我将其切换为每周重新启动,并且只在内存中保留我需要的模型。 (顺便说一下,这是在 AWS 实例上,4GB 内存;通过 cron 作业和 bash 命令重新启动。)
您始终可以从 https://www.h2o.ai/download (there's a link labeled "latest stable release"). The latest stable Python package can be downloaded via PyPI and Anaconda 下载最新的稳定版本; CRAN 上提供了最新的稳定 R 包。
我同意 Darren 的观点,您可能需要更多内存——如果您的 H2O 集群中有足够的内存,H2O 应该不会崩溃。我们通常说,为了训练模型,您应该拥有一个至少是磁盘上训练集大小 3-4 倍的集群。但是,如果您要构建模型网格或多个模型,则需要增加内存,以便有足够的 RAM 来存储所有这些模型。
去年我一直在使用 H2O,我对服务器崩溃感到非常厌倦。我已经放弃 "nightly releases",因为它们很容易被我的数据集崩溃。请告诉我在哪里可以下载稳定的版本。
查尔斯
我的环境是:
- Windows 10 个企业,build 1607,64 GB 内存。
- Java SE 开发套件 8 更新 77(64 位)。
- 蟒蛇 Python 3.6.2-0.
我用以下命令启动了服务器:
localH2O = h2o.init(ip = "localhost",
port = 54321,
max_mem_size="12G",
nthreads = 4)
h2o初始化信息为:
H2O cluster uptime: 12 hours 12 mins
H2O cluster version: 3.10.5.2
H2O cluster version age: 1 month and 6 days
H2O cluster name: H2O_from_python_Charles_ji1ndk
H2O cluster total nodes: 1
H2O cluster free memory: 6.994 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 4
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy:
H2O internal security: False
Python version: 3.6.2 final
崩溃信息为:
OSError: Job with key 017f00000132d4ffffffff$_a0ce9b2c855ea5cff1aa58d65c2a4e7c failed with an exception: java.lang.AssertionError: I am really confused about the heap usage; MEM_MAX=11453595648 heapUsedGC=11482667352
stacktrace:
java.lang.AssertionError: I am really confused about the heap usage; MEM_MAX=11453595648 heapUsedGC=11482667352
at water.MemoryManager.set_goals(MemoryManager.java:97)
at water.MemoryManager.malloc(MemoryManager.java:265)
at water.MemoryManager.malloc(MemoryManager.java:222)
at water.MemoryManager.arrayCopyOfRange(MemoryManager.java:291)
at water.AutoBuffer.expandByteBuffer(AutoBuffer.java:719)
at water.AutoBuffer.putA4f(AutoBuffer.java:1355)
at hex.deeplearning.Storage$DenseRowMatrix$Icer.write129(Storage$DenseRowMatrix$Icer.java)
at hex.deeplearning.Storage$DenseRowMatrix$Icer.write(Storage$DenseRowMatrix$Icer.java)
at water.Iced.write(Iced.java:61)
at water.AutoBuffer.put(AutoBuffer.java:771)
at water.AutoBuffer.putA(AutoBuffer.java:883)
at hex.deeplearning.DeepLearningModelInfo$Icer.write128(DeepLearningModelInfo$Icer.java)
at hex.deeplearning.DeepLearningModelInfo$Icer.write(DeepLearningModelInfo$Icer.java)
at water.Iced.write(Iced.java:61)
at water.AutoBuffer.put(AutoBuffer.java:771)
at hex.deeplearning.DeepLearningModel$Icer.write105(DeepLearningModel$Icer.java)
at hex.deeplearning.DeepLearningModel$Icer.write(DeepLearningModel$Icer.java)
at water.Iced.write(Iced.java:61)
at water.Iced.asBytes(Iced.java:42)
at water.Value.<init>(Value.java:348)
at water.TAtomic.atomic(TAtomic.java:22)
at water.Atomic.compute2(Atomic.java:56)
at water.Atomic.fork(Atomic.java:39)
at water.Atomic.invoke(Atomic.java:31)
at water.Lockable.unlock(Lockable.java:181)
at water.Lockable.unlock(Lockable.java:176)
at hex.deeplearning.DeepLearning$DeepLearningDriver.trainModel(DeepLearning.java:491)
at hex.deeplearning.DeepLearning$DeepLearningDriver.buildModel(DeepLearning.java:311)
at hex.deeplearning.DeepLearning$DeepLearningDriver.computeImpl(DeepLearning.java:216)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:173)
at hex.deeplearning.DeepLearning$DeepLearningDriver.compute2(DeepLearning.java:209)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
你需要一艘更大的船。
错误消息显示 "heapUsedGC=11482667352",高于 MEM_MAX。为什么不给它更多的 64GB,而不是给 max_mem_size="12G"
?或者构建一个不那么雄心勃勃的模型(更少的隐藏节点,更少的训练数据,诸如此类)。
(显然,理想情况下,h2o 不应该崩溃,而应该在它接近使用所有可用内存时优雅地中止。如果您能够与 H2O 共享您的 data/code,它可能值得在他们的 JIRA 上打开错误报告。)
顺便说一句,我已经 运行 h2o 3.10.x.x 作为 web 服务器进程的后端 9 个月左右,在周末自动重启它,而且还没有有一次崩溃。好吧,我做到了 - 在我离开它 运行 3 周后,它用越来越多的数据和模型填满了所有内存。这就是为什么我将其切换为每周重新启动,并且只在内存中保留我需要的模型。 (顺便说一下,这是在 AWS 实例上,4GB 内存;通过 cron 作业和 bash 命令重新启动。)
您始终可以从 https://www.h2o.ai/download (there's a link labeled "latest stable release"). The latest stable Python package can be downloaded via PyPI and Anaconda 下载最新的稳定版本; CRAN 上提供了最新的稳定 R 包。
我同意 Darren 的观点,您可能需要更多内存——如果您的 H2O 集群中有足够的内存,H2O 应该不会崩溃。我们通常说,为了训练模型,您应该拥有一个至少是磁盘上训练集大小 3-4 倍的集群。但是,如果您要构建模型网格或多个模型,则需要增加内存,以便有足够的 RAM 来存储所有这些模型。