如何将 SLURM-jobID 作为输入参数传递给 python?
How to pass the SLURM-jobID as an input argument to python?
我刚开始使用 SLURM 训练一批卷积神经网络。为了轻松跟踪所有训练有素的 CNN,我想将 SLURM jobID 作为输入参数传递给 python。将其他变量作为参数传递很好。但是,我无法访问要通过的 SLURM jobid。
我已经尝试使用 ${SLURM_JOBID}
、${SLURM_JOB_ID}
、%j
和 %J
。我还尝试将这些 slurm env 变量写入变量,然后再传递到 python。
这是我的最新代码:
#!/bin/bash
# --- info to user
echo "script started ... "
# --- setup environment
module purge # clean up
module load python/3.6
module load nvidia/10.0
module load cudnn/10.0-v7
# --- display information
HOST=`hostname`
echo "This script runs the CNN. Slurm scheduled it on node $HOST"
echo "I am interested of all environment variables Slurm adds:"
env | grep -i slurm
# --- start running ...
echo " --- run --- "
# --- define some varibles
dc="dice"
sm="softmax"
# --- run a job using a slurm batch script
for layer in {3..15..2}
do
sbatch -N 1 -n 1 --mem=20G --mail-type=END --gres=gpu:V100:3 --wrap="singularity --noslurm tensorflow_19.03-py3.simg python run_CNN_dynlayer.py ${SLURM_JOBID} ${layer} ${dc}"
sleep 1 # pause 1s to be kind to the scheduler...
echo "jobid: "+${SLURM_JOBID}
echo " --- next --- "
done
cmd 看起来像这样:
femonk@rarp1 [CNN] ./run_CNN_test.slurm
script started ...
This script runs the CNN. Slurm scheduled it on node rarp1
I am interested of all environment variables Slurm adds:
SLURM_ACCOUNT=AI
PYTHONPATH=/cluster/slurm/lib64/python3.6/site-packages:/cluster/slurm/lib64/python3.6/site-packages:/cluster/slurm/lib64/python3.6/site-packages:
--- run ---
Submitted batch job 3182711
jobid:
--- next ---
femonk@rarp1 [CNN]
有人知道我的代码有什么问题吗?
提前致谢。
SLURM_JOBID
环境变量仅可用于作业进程,不适用于提交作业的进程。作业 ID 是从 sbatch
命令返回的,所以如果你想在变量中使用它,你需要分配它。
do
SLURM_JOBID=$(sbatch --parsable -N 1 -n 1 --mem=20G --mail-type=END --gres=gpu:V100:3 --wrap="singularity --noslurm tensorflow_19.03-py3.simg python run_CNN_dynlayer.py ${SLURM_JOBID} ${layer} ${dc}")
sleep 1 # pause 1s to be kind to the scheduler...
echo "jobid: "+${SLURM_JOBID}
echo " --- next --- "
done
注意命令替换 $()
与 sbatch
的 --parsable
参数一起使用。
另请注意,当前输出的行 Submitted batch job 3182711
将消失,因为它用于填充 SLURM_JOBID
变量。
我刚开始使用 SLURM 训练一批卷积神经网络。为了轻松跟踪所有训练有素的 CNN,我想将 SLURM jobID 作为输入参数传递给 python。将其他变量作为参数传递很好。但是,我无法访问要通过的 SLURM jobid。
我已经尝试使用 ${SLURM_JOBID}
、${SLURM_JOB_ID}
、%j
和 %J
。我还尝试将这些 slurm env 变量写入变量,然后再传递到 python。
这是我的最新代码:
#!/bin/bash
# --- info to user
echo "script started ... "
# --- setup environment
module purge # clean up
module load python/3.6
module load nvidia/10.0
module load cudnn/10.0-v7
# --- display information
HOST=`hostname`
echo "This script runs the CNN. Slurm scheduled it on node $HOST"
echo "I am interested of all environment variables Slurm adds:"
env | grep -i slurm
# --- start running ...
echo " --- run --- "
# --- define some varibles
dc="dice"
sm="softmax"
# --- run a job using a slurm batch script
for layer in {3..15..2}
do
sbatch -N 1 -n 1 --mem=20G --mail-type=END --gres=gpu:V100:3 --wrap="singularity --noslurm tensorflow_19.03-py3.simg python run_CNN_dynlayer.py ${SLURM_JOBID} ${layer} ${dc}"
sleep 1 # pause 1s to be kind to the scheduler...
echo "jobid: "+${SLURM_JOBID}
echo " --- next --- "
done
cmd 看起来像这样:
femonk@rarp1 [CNN] ./run_CNN_test.slurm
script started ...
This script runs the CNN. Slurm scheduled it on node rarp1
I am interested of all environment variables Slurm adds:
SLURM_ACCOUNT=AI
PYTHONPATH=/cluster/slurm/lib64/python3.6/site-packages:/cluster/slurm/lib64/python3.6/site-packages:/cluster/slurm/lib64/python3.6/site-packages:
--- run ---
Submitted batch job 3182711
jobid:
--- next ---
femonk@rarp1 [CNN]
有人知道我的代码有什么问题吗? 提前致谢。
SLURM_JOBID
环境变量仅可用于作业进程,不适用于提交作业的进程。作业 ID 是从 sbatch
命令返回的,所以如果你想在变量中使用它,你需要分配它。
do
SLURM_JOBID=$(sbatch --parsable -N 1 -n 1 --mem=20G --mail-type=END --gres=gpu:V100:3 --wrap="singularity --noslurm tensorflow_19.03-py3.simg python run_CNN_dynlayer.py ${SLURM_JOBID} ${layer} ${dc}")
sleep 1 # pause 1s to be kind to the scheduler...
echo "jobid: "+${SLURM_JOBID}
echo " --- next --- "
done
注意命令替换 $()
与 sbatch
的 --parsable
参数一起使用。
另请注意,当前输出的行 Submitted batch job 3182711
将消失,因为它用于填充 SLURM_JOBID
变量。