org.apache.spark.SparkException: 作业因阶段失败而中止:阶段 11.0 中的任务 98 失败 4 次

org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 in stage 11.0 failed 4 times

我正在使用 Google Cloud Dataproc 来执行 spark 作业,我的编辑器是 Zepplin。我试图将 json 数据写入 gcp 存储桶。之前我尝试 10MB 文件时它成功了。但是 10GB 的文件失败了。我的 dataproc 有 1 个 master,4CPU,26GB 内存,500GB 磁盘。 5 名具有相同配置的工人。我想它应该能够处理 10GB 数据。

我的命令是toDatabase.repartition(10).write.json("gs://mypath")

错误是

org.apache.spark.SparkException: Job aborted.
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:131)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute.apply(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand.apply(DataFrameWriter.scala:656)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand.apply(DataFrameWriter.scala:656)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
  at org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:528)
  ... 54 elided
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 in stage 11.0 failed 4 times, most recent failure: Lost task 98.3 in stage 11.0 (TID 3895, etl-w-2.us-east1-b.c.team-etl-234919.internal, executor 294): ExecutorLostFailure (executor 294 exited caused by one of the running tasks) Reason: Container marked as failed: container_1554684028327_0001_01_000307 on host: etl-w-2.us-east1-b.c.team-etl-234919.internal. Exit status: 143. Diagnostics: [2019-04-08 01:50:14.153]Container killed on request. Exit code is 143
[2019-04-08 01:50:14.153]Container exited with a non-zero exit code 143.
[2019-04-08 01:50:14.154]Killed by external signal

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGScheduler.scala:1639)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage.apply(DAGScheduler.scala:1638)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.apply(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
  at org.apache.spark.util.EventLoop$$anon.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
  ... 74 more

知道为什么吗?

如果它 运行 在较小的数据集上而不是在较大的数据集上,您很可能会 运行 进入 Spark worker 的内存不足限制。每个工作人员的内存问题将更多地取决于您的分区和每个执行程序设置,而不是整个集群范围内的可用内存总量(因此创建更大的集群对此类问题没有帮助)。

您可以尝试以下任意组合:

  1. 重新划分为更多的分区而不是 10 个输出分区
  2. 使用 highmem 而不是 standard 台机器创建集群
  3. 使用将内存比率更改为 CPU 的 spark 内存设置创建集群:例如 gcloud dataproc clusters create --properties spark:spark.executor.cores=1 会将每个执行程序更改为一次仅 运行 一个任务内存量,而 Dataproc 通常 运行s 每台机器 2 个执行程序并相应地划分 CPUs。在 4 核机器上,通常有 2 个执行程序,每个执行程序允许 2 个内核。然后,此设置只会为这 2 个执行程序中的每一个提供 1 个核心,同时仍使用一半机器的内存。