Databricks 笔记本在内存作业上崩溃
Databricks notebooks crashes on memory job
我 运行 在 azure databricks 上聚合大量数据(约 600gb)的操作很少。我最近注意到笔记本崩溃,数据块 returns 出现以下错误。相同的代码以前适用于较小的 6 节点集群。将其升级到 12 个节点后,我开始收到此消息,我怀疑这是配置问题。
请提供任何帮助,我使用默认的 spark 配置,分区数 = 200,我的节点上有 88 个执行程序。
Thanks
Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
java.lang.RuntimeException: abort: DriverClient destroyed
at com.databricks.backend.daemon.driver.DriverClient.$anonfun$poll(DriverClient.scala:381)
at scala.concurrent.Future.$anonfun$flatMap(Future.scala:307)
at scala.concurrent.impl.Promise.$anonfun$transformWith(Promise.scala:41)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at com.databricks.threading.NamedExecutor$$anon.$anonfun$run(NamedExecutor.scala:335)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext(UsageLogging.scala:238)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:233)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:230)
at com.databricks.threading.NamedExecutor.withAttributionContext(NamedExecutor.scala:265)
at com.databricks.threading.NamedExecutor$$anon.run(NamedExecutor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
我不确定成本影响,但如何在集群上启用自动缩放选项并提高 Max Workers。您也可以尝试更改 Worker Type 以获得更好的资源
仅供面临类似问题的其他人使用。
在我的情况下,当 Databricks notebook 的 one 单元格中有多个 Spark actions 时,有时会发生相同的错误。
令人惊讶的是,拆分在发生错误的代码之前的单元格或简单地插入time.sleep(5)
对我有用。但是我不确定它为什么起作用...
例如:
df1.count() # some Spark action
# split the cell or insert `time.sleep(5)` here
pipeline.fit(df1) # another Spark action where the error happened
我 运行 在 azure databricks 上聚合大量数据(约 600gb)的操作很少。我最近注意到笔记本崩溃,数据块 returns 出现以下错误。相同的代码以前适用于较小的 6 节点集群。将其升级到 12 个节点后,我开始收到此消息,我怀疑这是配置问题。
请提供任何帮助,我使用默认的 spark 配置,分区数 = 200,我的节点上有 88 个执行程序。
Thanks
Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
java.lang.RuntimeException: abort: DriverClient destroyed
at com.databricks.backend.daemon.driver.DriverClient.$anonfun$poll(DriverClient.scala:381)
at scala.concurrent.Future.$anonfun$flatMap(Future.scala:307)
at scala.concurrent.impl.Promise.$anonfun$transformWith(Promise.scala:41)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at com.databricks.threading.NamedExecutor$$anon.$anonfun$run(NamedExecutor.scala:335)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext(UsageLogging.scala:238)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:233)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:230)
at com.databricks.threading.NamedExecutor.withAttributionContext(NamedExecutor.scala:265)
at com.databricks.threading.NamedExecutor$$anon.run(NamedExecutor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
我不确定成本影响,但如何在集群上启用自动缩放选项并提高 Max Workers。您也可以尝试更改 Worker Type 以获得更好的资源
仅供面临类似问题的其他人使用。
在我的情况下,当 Databricks notebook 的 one 单元格中有多个 Spark actions 时,有时会发生相同的错误。
令人惊讶的是,拆分在发生错误的代码之前的单元格或简单地插入time.sleep(5)
对我有用。但是我不确定它为什么起作用...
例如:
df1.count() # some Spark action
# split the cell or insert `time.sleep(5)` here
pipeline.fit(df1) # another Spark action where the error happened