运行 通过 SLURM 每个节点一次多个作业
Run multiple jobs at a time per node through SLURM
我有一个正在使用的集群,它有 3 个节点,每个节点有 110GB RAM,每个节点上有 16 个核心。只要指定的内存可用,我就想继续向节点提交作业。
我正在使用这个名为 test_slurm.sh
:
的 bash 脚本
#!/bin/sh
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
python test.py
因此,如果我有 33 个 10gb 的作业,并且我有 3 个节点和 110gb 的 RAM,我希望能够 运行 如果可能的话一次全部 33 个而不是我现在的 3 个设置确实如此。
这是 squeue 的样子:
所以即使我有足够的内存来做更多的事情,一次也只能做三个工作 运行。
sinfo -o "%all"
returns:
AVAIL|CPUS|TMP_DISK|FEATURES|GROUPS|SHARE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIORITY|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |SOCKETS |CORES |THREADS
up|16|0|(null)|all|NO|infinite|115328|parrot101|parrot101|1|no|1-infinite|alloc|Unknown|14.03|1|16:1:1|1/0 |UNLIMITED |16/0/0/16 |1 |none |1/0/0/1 |(null) |Unknown |n/a |OFF |parrot101 |0.01 |myNodes* |myNodes |all |allocated |Unknown |16 |1 |1
up|16|0|(null)|all|NO|infinite|115328|parrot102|parrot102|1|no|1-infinite|alloc|Unknown|14.03|1|16:1:1|1/0 |UNLIMITED |16/0/0/16 |1 |none |1/0/0/1 |(null) |Unknown |n/a |OFF |parrot102 |0.14 |myNodes* |myNodes |all |allocated |Unknown |16 |1 |1
up|16|0|(null)|all|NO|infinite|115328|parrot103|parrot103|1|no|1-infinite|alloc|Unknown|14.03|1|16:1:1|1/0 |UNLIMITED |16/0/0/16 |1 |none |1/0/0/1 |(null) |Unknown |n/a |OFF |parrot103 |0.26 |myNodes* |myNodes |all |allocated |Unknown |16 |1 |1
squeue -o "%all"
的输出
returns:
ACCOUNT|GRES|MIN_CPUS|MIN_TMP_DISK|END_TIME|FEATURES|GROUP|SHARED|JOBID|NAME|COMMENT|TIMELIMIT|MIN_MEMORY|REQ_NODES|COMMAND|PRIORITY|QOS|REASON||ST|USER|RESERVATION|WCKEY|EXC_NODES|NICE|S:C:T|JOBID |EXEC_HOST |CPUS |NODES |DEPENDENCY |ARRAY_JOB_ID |GROUP |SOCKETS_PER_NODE |CORES_PER_SOCKET |THREADS_PER_CORE |ARRAY_TASK_ID |TIME_LEFT |TIME |NODELIST |CONTIGUOUS |PARTITION |PRIORITY |NODELIST(REASON) |START_TIME |STATE |USER |SUBMIT_TIME |LICENSES |CORE_SPECWORK_DIR
(null)|(null)|1|0|N/A|(null)|j1101|no|26609|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 26|0.99998411652632|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26609 |n/a |1 |1 | |26609 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899076 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
(null)|(null)|1|0|N/A|(null)|j1101|no|26610|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 27|0.99998411629349|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26610 |n/a |1 |1 | |26610 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899075 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
(null)|(null)|1|0|N/A|(null)|j1101|no|26611|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 28|0.99998411606066|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26611 |n/a |1 |1 | |26611 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899074 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
(null)|(null)|1|0|N/A|(null)|j1101|no|26612|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 29|0.99998411582782|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26612 |n/a |1 |1 | |26612 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899073 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
(null)|(null)|1|0|N/A|(null)|j1101|no|26613|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 30|0.99998411559499|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26613 |n/a |1 |1 | |26613 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899072 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
根据您 sinfo -o "%all"
的输出,我可以回答您的作业未通过的原因。
如果您查看 CPUS(A/I/O/T)
字段,所有节点的输出都是 16/0/0/16
:
A
位于:16
I
空闲(可用于工作):0
O
其他: 0
T
合计:16
即不知何故,CPU 是作业未通过的原因,而不是您预期的内存。所有 CPU 似乎都由(其他)作业分配。
现在至于为什么...我们目前没有足够的信息。 squeue -o "%all"
的输出将提供更多见解。
我有一个正在使用的集群,它有 3 个节点,每个节点有 110GB RAM,每个节点上有 16 个核心。只要指定的内存可用,我就想继续向节点提交作业。
我正在使用这个名为 test_slurm.sh
:
#!/bin/sh
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=10G
python test.py
因此,如果我有 33 个 10gb 的作业,并且我有 3 个节点和 110gb 的 RAM,我希望能够 运行 如果可能的话一次全部 33 个而不是我现在的 3 个设置确实如此。
这是 squeue 的样子:
所以即使我有足够的内存来做更多的事情,一次也只能做三个工作 运行。
sinfo -o "%all"
returns:
AVAIL|CPUS|TMP_DISK|FEATURES|GROUPS|SHARE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIORITY|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |SOCKETS |CORES |THREADS
up|16|0|(null)|all|NO|infinite|115328|parrot101|parrot101|1|no|1-infinite|alloc|Unknown|14.03|1|16:1:1|1/0 |UNLIMITED |16/0/0/16 |1 |none |1/0/0/1 |(null) |Unknown |n/a |OFF |parrot101 |0.01 |myNodes* |myNodes |all |allocated |Unknown |16 |1 |1
up|16|0|(null)|all|NO|infinite|115328|parrot102|parrot102|1|no|1-infinite|alloc|Unknown|14.03|1|16:1:1|1/0 |UNLIMITED |16/0/0/16 |1 |none |1/0/0/1 |(null) |Unknown |n/a |OFF |parrot102 |0.14 |myNodes* |myNodes |all |allocated |Unknown |16 |1 |1
up|16|0|(null)|all|NO|infinite|115328|parrot103|parrot103|1|no|1-infinite|alloc|Unknown|14.03|1|16:1:1|1/0 |UNLIMITED |16/0/0/16 |1 |none |1/0/0/1 |(null) |Unknown |n/a |OFF |parrot103 |0.26 |myNodes* |myNodes |all |allocated |Unknown |16 |1 |1
squeue -o "%all"
returns:
ACCOUNT|GRES|MIN_CPUS|MIN_TMP_DISK|END_TIME|FEATURES|GROUP|SHARED|JOBID|NAME|COMMENT|TIMELIMIT|MIN_MEMORY|REQ_NODES|COMMAND|PRIORITY|QOS|REASON||ST|USER|RESERVATION|WCKEY|EXC_NODES|NICE|S:C:T|JOBID |EXEC_HOST |CPUS |NODES |DEPENDENCY |ARRAY_JOB_ID |GROUP |SOCKETS_PER_NODE |CORES_PER_SOCKET |THREADS_PER_CORE |ARRAY_TASK_ID |TIME_LEFT |TIME |NODELIST |CONTIGUOUS |PARTITION |PRIORITY |NODELIST(REASON) |START_TIME |STATE |USER |SUBMIT_TIME |LICENSES |CORE_SPECWORK_DIR
(null)|(null)|1|0|N/A|(null)|j1101|no|26609|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 26|0.99998411652632|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26609 |n/a |1 |1 | |26609 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899076 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
(null)|(null)|1|0|N/A|(null)|j1101|no|26610|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 27|0.99998411629349|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26610 |n/a |1 |1 | |26610 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899075 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
(null)|(null)|1|0|N/A|(null)|j1101|no|26611|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 28|0.99998411606066|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26611 |n/a |1 |1 | |26611 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899074 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
(null)|(null)|1|0|N/A|(null)|j1101|no|26612|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 29|0.99998411582782|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26612 |n/a |1 |1 | |26612 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899073 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
(null)|(null)|1|0|N/A|(null)|j1101|no|26613|slurm_py_submit.sh|(null)|UNLIMITED|40K||/att/gpfsfs/home/spotter5/python/slurm_py_submit.sh 1 rcp85 30|0.99998411559499|(null)|Resources||PD|spotter5|(null)|(null)||0|*:*:*|26613 |n/a |1 |1 | |26613 |61101 |* |* |* |N/A |UNLIMITED |0:00 | |0 |myNodes |4294899072 |(Resources) |2019-03-19T13:03:57 |PENDING |474609391 |2018-03-19T11:57:39 |(null) |0/att/gpfsfs/home/spotter5/python
根据您 sinfo -o "%all"
的输出,我可以回答您的作业未通过的原因。
如果您查看 CPUS(A/I/O/T)
字段,所有节点的输出都是 16/0/0/16
:
A
位于:16I
空闲(可用于工作):0O
其他: 0T
合计:16
即不知何故,CPU 是作业未通过的原因,而不是您预期的内存。所有 CPU 似乎都由(其他)作业分配。
现在至于为什么...我们目前没有足够的信息。 squeue -o "%all"
的输出将提供更多见解。