Spark Error: Executor XXX finished with state EXITED message Command exited with code 1 exitStatus 1
Spark Error: Executor XXX finished with state EXITED message Command exited with code 1 exitStatus 1
我在 Oracle linux 上构建独立的 spark 集群。我在 Master 上的 spark-env.sh 中添加了这一行:
export SPARK_MASTER_HOST=x.x.x.x
并在 Master 和 Worker 的 spark-env.sh 中添加这些行:
export PYSPARK_PYTHON=/usr/bin/python3.8
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.8
此外,我在 worker 文件中为 Master 和 Worker 插入了 worker 的 IP。我以这种方式启动 Spark Cluster:
在大师中:
/opt/spark/sbin/start-master.sh
在工人中:
/opt/spark/sbin/start-worker.sh spark://x.x.x.x:7077
其实我是一工一师傅。我这样配置 ~/.bashrc:
export JAVA_HOME=/opt/oracle/java/jdk1.8.0_25
export PATH=$JAVA_HOME/bin:$PATH
alias python=/usr/bin/python3.8
export LD_LIBRARY_PATH=/opt/oracle/instantclient_21_4:$LD_LIBRARY_PATH
export PATH=/opt/oracle/instantclient_21_4:$PATH
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_HOME=/usr/bin/python3.8
export PYSPARK_DRIVER_PYTHON=python3.8
export PYSPARK_PYTHON=/usr/bin/python3.8
虽然我 运行 spark-submit 我没有错误,但是命令 运行 永远没有任何结果。我看到这些行:
22/03/04 12:07:40 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks resource profile 0
22/03/04 12:07:41 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220304120738-0000/0 is now EXITED (Command exited with code 1)
22/03/04 12:07:41 INFO StandaloneSchedulerBackend: Executor app-20220304120738-0000/0 removed: Command exited with code 1
22/03/04 12:07:41 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220304120738-0000/3 on worker-20220304120443-192.9.200.68-42185 (192.9.200.68:42185) with 2 core(s)
22/03/04 12:07:41 INFO StandaloneSchedulerBackend: Granted executor ID app-20220304120738-0000/3 on hostPort 192.9.200.68:42185 with 2 core(s), 2.0 GiB RAM
我检查工作日志,我有这个错误:
22/03/04 12:07:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with m$
22/03/04 12:07:38 INFO ExecutorRunner: Launch command: "/opt/oracle/java/jdk1.8.0_25/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx2048M" "-Dspark.driver.port=40345" "-XX:+PrintGC$
22/03/04 12:07:38 INFO ExecutorRunner: Launch command: "/opt/oracle/java/jdk1.8.0_25/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx2048M" "-Dspark.driver.port=40345" "-XX:+PrintGC$
22/03/04 12:07:38 INFO ExecutorRunner: Launch command: "/opt/oracle/java/jdk1.8.0_25/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx2048M" "-Dspark.driver.port=40345" "-XX:+PrintGC$
22/03/04 12:07:41 INFO Worker: Executor app-20220304120738-0000/0 finished with state EXITED message Command exited with code 1 exitStatus 1
22/03/04 12:07:41 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 0
22/03/04 12:07:41 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220304120738-0000, execId=0)
spark-submit是这样的:
/opt/spark/bin/spark-submit --master spark://x.x.x.x:7077 --files etl/sparkConfig.json --py-files etl/brn_utils.py,etl/cst.py,etl/cst_utils.py,etl/emp_utils.py,etl/general_utils.py,etl/grouping.py,etl/grp_state.py,etl/conn.py etl/main.py
我在 root 用户中测试,我还创建了 spark 用户并且没有任何变化。
你能指导我哪里出了问题吗?
谢谢。
问题已解决。
我认为是网络问题。自从我将这部分添加到 spark-submit 后,一切正常。
--conf spark.driver.host=x.x.x.x
事实上,我运行这个:
/opt/spark/bin/spark-submit --master spark://x.x.x.x:7077 --conf spark.driver.host=x.x.x.x --files etl/sparkConfig.json --py-files etl/brn_utils.py,etl/cst.py,etl/cst_utils.py,etl/emp_utils.py,etl/general_utils.py,etl/grouping.py,etl/grp_state.py,etl/conn.py etl/main.py
小心将你的程序复制到所有节点的同一个地方。
另外,因为我远程访问了集群,所以我使用 SSH 隧道 在我的计算机中有 UI。像这样:
ssh spark@master_ip -N -L 4040:master_ip:8080
在上面的命令中,4040是我电脑的端口,8080是主机的端口。创建 SSH 隧道 后,我可以在浏览器中写入 Master_IP:8080 打开 spark UI。
希望对您有所帮助。
我在 Oracle linux 上构建独立的 spark 集群。我在 Master 上的 spark-env.sh 中添加了这一行:
export SPARK_MASTER_HOST=x.x.x.x
并在 Master 和 Worker 的 spark-env.sh 中添加这些行:
export PYSPARK_PYTHON=/usr/bin/python3.8
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.8
此外,我在 worker 文件中为 Master 和 Worker 插入了 worker 的 IP。我以这种方式启动 Spark Cluster: 在大师中:
/opt/spark/sbin/start-master.sh
在工人中:
/opt/spark/sbin/start-worker.sh spark://x.x.x.x:7077
其实我是一工一师傅。我这样配置 ~/.bashrc:
export JAVA_HOME=/opt/oracle/java/jdk1.8.0_25
export PATH=$JAVA_HOME/bin:$PATH
alias python=/usr/bin/python3.8
export LD_LIBRARY_PATH=/opt/oracle/instantclient_21_4:$LD_LIBRARY_PATH
export PATH=/opt/oracle/instantclient_21_4:$PATH
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_HOME=/usr/bin/python3.8
export PYSPARK_DRIVER_PYTHON=python3.8
export PYSPARK_PYTHON=/usr/bin/python3.8
虽然我 运行 spark-submit 我没有错误,但是命令 运行 永远没有任何结果。我看到这些行:
22/03/04 12:07:40 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks resource profile 0
22/03/04 12:07:41 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220304120738-0000/0 is now EXITED (Command exited with code 1)
22/03/04 12:07:41 INFO StandaloneSchedulerBackend: Executor app-20220304120738-0000/0 removed: Command exited with code 1
22/03/04 12:07:41 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220304120738-0000/3 on worker-20220304120443-192.9.200.68-42185 (192.9.200.68:42185) with 2 core(s)
22/03/04 12:07:41 INFO StandaloneSchedulerBackend: Granted executor ID app-20220304120738-0000/3 on hostPort 192.9.200.68:42185 with 2 core(s), 2.0 GiB RAM
我检查工作日志,我有这个错误:
22/03/04 12:07:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with m$
22/03/04 12:07:38 INFO ExecutorRunner: Launch command: "/opt/oracle/java/jdk1.8.0_25/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx2048M" "-Dspark.driver.port=40345" "-XX:+PrintGC$
22/03/04 12:07:38 INFO ExecutorRunner: Launch command: "/opt/oracle/java/jdk1.8.0_25/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx2048M" "-Dspark.driver.port=40345" "-XX:+PrintGC$
22/03/04 12:07:38 INFO ExecutorRunner: Launch command: "/opt/oracle/java/jdk1.8.0_25/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx2048M" "-Dspark.driver.port=40345" "-XX:+PrintGC$
22/03/04 12:07:41 INFO Worker: Executor app-20220304120738-0000/0 finished with state EXITED message Command exited with code 1 exitStatus 1
22/03/04 12:07:41 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 0
22/03/04 12:07:41 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220304120738-0000, execId=0)
spark-submit是这样的:
/opt/spark/bin/spark-submit --master spark://x.x.x.x:7077 --files etl/sparkConfig.json --py-files etl/brn_utils.py,etl/cst.py,etl/cst_utils.py,etl/emp_utils.py,etl/general_utils.py,etl/grouping.py,etl/grp_state.py,etl/conn.py etl/main.py
我在 root 用户中测试,我还创建了 spark 用户并且没有任何变化。
你能指导我哪里出了问题吗?
谢谢。
问题已解决。
我认为是网络问题。自从我将这部分添加到 spark-submit 后,一切正常。
--conf spark.driver.host=x.x.x.x
事实上,我运行这个:
/opt/spark/bin/spark-submit --master spark://x.x.x.x:7077 --conf spark.driver.host=x.x.x.x --files etl/sparkConfig.json --py-files etl/brn_utils.py,etl/cst.py,etl/cst_utils.py,etl/emp_utils.py,etl/general_utils.py,etl/grouping.py,etl/grp_state.py,etl/conn.py etl/main.py
小心将你的程序复制到所有节点的同一个地方。 另外,因为我远程访问了集群,所以我使用 SSH 隧道 在我的计算机中有 UI。像这样:
ssh spark@master_ip -N -L 4040:master_ip:8080
在上面的命令中,4040是我电脑的端口,8080是主机的端口。创建 SSH 隧道 后,我可以在浏览器中写入 Master_IP:8080 打开 spark UI。
希望对您有所帮助。