如何在 spark-submit 的 shell 脚本中捕获作业状态

Question

我正在使用带有 spark-sql-2.4.1v 的 bashshell。我在 shell 脚本中使用 spark-submit 提交我的 spark 作业。

Need to capture the status of my job. how can this be achieved ?

Anyhelp/advice 请?

Answer 1

检查下面的代码。

process_start_datetime=$(date +%Y%m%d%H%M%S)
log_path="<log_dir>"
log_file="${log_path}/${app_name}_${process_start_datetime}.log"

spark-submit \
    --verbose \
    --deploy-mode cluster \
    --executor-cores "$executor_cores" \
    --num-executors "$num_executors" \
    --driver-memory "$driver_memory" \
    --executor-memory "$executor_memory"  \
    --master yarn \
    --class main.App "$appJar" 2>&1 | tee -a "$log_file"

status=$(grep "final status:" < "$log_file" | cut -d ":" -f2 | tail -1 | awk '=')

获取应用程序ID

applicationId=$(grep "tracking URL" < "$log_file" | head -n 1 | cut -d "/" -f5)

Answer 2

spark-submit是一个async job，所以当我们提交命令的时候可以通过调用SparkContext.applicationId来获取application id。然后您可以查看状态。

引用-https://issues.apache.org/jira/browse/SPARK-5439

如果 spark 部署在 Yarn 上，那么你可以使用 -

检查状态

///To get application ID use yarn application -list
yarn application -status application_1459542433815_0002

他们在这个answer

中提到了另一种方式

如何在 spark-submit 的 shell 脚本中捕获作业状态

how to capture the job status in shell script for spark-submit

sh

apache-spark

apache-spark-sql

airflow