Slurmd 在开始时保持 inactive/failed

Slurmd remains inactive/failed on start

我目前有一个由 Slurm 管理的 10 个工作节点和 1 个主节点组成的集群。在遇到一些初期问题之后,我之前已经成功地设置了集群,但设法让它工作了。我将所有脚本和说明放在我的 GitHub 存储库 (https://brettchapman.github.io/Nimbus_Cluster/) 中。我最近需要重新开始增加硬盘space,现在无论我怎么尝试都无法正确安装和配置它

Slurmctld 和 slurmdbd 安装和配置正确(均处于活动状态和 运行 systemctl status 命令),但是 slurmd 仍处于 failed/inactive 状态。

以下是我的slurm.conf文件:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=node-0
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/cgroup
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=600
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/filetxt
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
JobCompHost=localhost
JobCompLoc=slurm_acct_db
JobCompPass=password
#JobCompPort=
JobCompType=jobcomp/mysql
JobCompUser=slurm
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=node-[1-10] NodeAddr=node-[1-10] CPUs=16 RealMemory=64323 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=node-[1-10] Default=YES MaxTime=INFINITE State=UP

下面是我的 slurmdbd.conf 文件:

AuthType=auth/munge
AuthInfo=/run/munge/munge.socket.2
DbdHost=localhost
DebugLevel=info
StorageHost=localhost
StorageLoc=slurm_acct_db
StoragePass=password
StorageType=accounting_storage/mysql
StorageUser=slurm
LogFile=/var/log/slurm-llnl/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
SlurmUser=slurm

运行 pdsh -a sudo systemctl status slurmd 在我的计算节点上给我以下错误:

pdsh@node-0: node-5: ssh exited with exit code 3
node-6: ● slurmd.service - Slurm node daemon
node-6:      Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
node-6:      Active: inactive (dead) since Tue 2020-08-11 03:52:58 UTC; 2min 45s ago
node-6:        Docs: man:slurmd(8)
node-6:     Process: 9068 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
node-6:    Main PID: 8983
node-6: 
node-6: Aug 11 03:34:09 node-6 systemd[1]: Starting Slurm node daemon...
node-6: Aug 11 03:34:09 node-6 systemd[1]: slurmd.service: Supervising process 8983 which is not our child. We'll most likely not notice when it exits.
node-6: Aug 11 03:34:09 node-6 systemd[1]: Started Slurm node daemon.
node-6: Aug 11 03:52:58 node-6 systemd[1]: slurmd.service: Killing process 8983 (n/a) with signal SIGKILL.
node-6: Aug 11 03:52:58 node-6 systemd[1]: slurmd.service: Killing process 8983 (n/a) with signal SIGKILL.
node-6: Aug 11 03:52:58 node-6 systemd[1]: slurmd.service: Succeeded.
pdsh@node-0: node-6: ssh exited with exit code 3

我以前没有收到过这种类型的错误,当我启动集群并 运行ning 时,所以我不确定从现在到上次我做了什么或没有做什么它运行宁。我的猜测是它与 file/folder 权限有关,因为我发现这在设置时可能非常重要。我可能错过了记录我以前做过的事情。这是我第二次尝试设置 slurm 托管集群。

我的整个工作流程和脚本都可以从我的 GitHub 存储库中获取。如果您需要任何其他错误输出,请询问。

感谢您提供的任何帮助。

布雷特

编辑:

查看 node-1 和 运行ning sudo slurmd -Dvvv 之一我得到这个:

slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
slurmd: debug:  CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug:  task/cgroup: now constraining jobs allocated cores
slurmd: debug:  task/cgroup/memory: total:64323M allowed:100%(enforced), swap:0%(permissive), max:100%(64323M) max+swap:100%(128646M) min:30M kmem:100%(64323M permissive) min:30M swappiness:0(unset)
slurmd: debug:  task/cgroup: now constraining jobs allocated memory
slurmd: debug:  task/cgroup: unable to open /etc/slurm-llnl/cgroup_allowed_devices_file.conf: No such file or directory
slurmd: debug:  task/cgroup: now constraining jobs allocated devices
slurmd: debug:  task/cgroup: loaded
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
slurmd: debug:  /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
slurmd: Munge credential signature plugin loaded
slurmd: slurmd version 19.05.5 started
slurmd: debug:  Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug:  job_container none plugin loaded
slurmd: debug:  switch NONE plugin loaded
slurmd: error: Error binding slurm stream socket: Address already in use
slurmd: error: Unable to bind listen port (*:6818): Address already in use

登录到不同的节点 node-10,我得到这个:

slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
slurmd: debug:  CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug:  task/cgroup: now constraining jobs allocated cores
slurmd: debug:  task/cgroup/memory: total:64323M allowed:100%(enforced), swap:0%(permissive), max:100%(64323M) max+swap:100%(128646M) min:30M kmem:100%(64323M permissive) min:30M swappiness:0(unset)
slurmd: debug:  task/cgroup: now constraining jobs allocated memory
slurmd: debug:  task/cgroup: unable to open /etc/slurm-llnl/cgroup_allowed_devices_file.conf: No such file or directory
slurmd: debug:  task/cgroup: now constraining jobs allocated devices
slurmd: debug:  task/cgroup: loaded
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
slurmd: debug:  /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
slurmd: Munge credential signature plugin loaded
slurmd: slurmd version 19.05.5 started
slurmd: debug:  Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug:  job_container none plugin loaded
slurmd: debug:  switch NONE plugin loaded
slurmd: slurmd started on Tue, 11 Aug 2020 06:56:10 +0000
slurmd: CPUs=16 Boards=1 Sockets=1 Cores=8 Threads=2 Memory=64323 TmpDisk=297553 Uptime=756 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug:  AcctGatherEnergy NONE plugin loaded
slurmd: debug:  AcctGatherProfile NONE plugin loaded
slurmd: debug:  AcctGatherInterconnect NONE plugin loaded
slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)
slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.

另一个节点,node-5,我明白了,和node-1一样:

slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
slurmd: debug:  CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug:  task/cgroup: now constraining jobs allocated cores
slurmd: debug:  task/cgroup/memory: total:64323M allowed:100%(enforced), swap:0%(permissive), max:100%(64323M) max+swap:100%(128646M) min:30M kmem:100%(64323M permissive) min:30M swappiness:0(unset)
slurmd: debug:  task/cgroup: now constraining jobs allocated memory
slurmd: debug:  task/cgroup: unable to open /etc/slurm-llnl/cgroup_allowed_devices_file.conf: No such file or directory
slurmd: debug:  task/cgroup: now constraining jobs allocated devices
slurmd: debug:  task/cgroup: loaded
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
slurmd: debug:  /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
slurmd: Munge credential signature plugin loaded
slurmd: slurmd version 19.05.5 started
slurmd: debug:  Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug:  job_container none plugin loaded
slurmd: debug:  switch NONE plugin loaded
slurmd: error: Error binding slurm stream socket: Address already in use
slurmd: error: Unable to bind listen port (*:6818): Address already in use

node-10 之前宕机了,我费了好大劲才把它弄回来,所以这个错误可能与整体问题无关。

Edit2:在所有节点上杀死卡住的 slurmd 进程后,slurmd 仍然无法启动:

slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: failed (Result: timeout) since Tue 2020-08-11 07:10:42 UTC; 3min 58s ago
       Docs: man:slurmd(8)

Aug 11 07:09:11 node-1 systemd[1]: Starting Slurm node daemon...
Aug 11 07:09:11 node-1 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not permitted
Aug 11 07:10:42 node-1 systemd[1]: slurmd.service: start operation timed out. Terminating.
Aug 11 07:10:42 node-1 systemd[1]: slurmd.service: Failed with result 'timeout'.
Aug 11 07:10:42 node-1 systemd[1]: Failed to start Slurm node daemon.

node1 上的 sudo slurmd -Dvvv 输出:

slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurmd/hwloc_topo_whole.xml) found
slurmd: debug:  CPUs:16 Boards:1 Sockets:1 CoresPerSocket:8 ThreadsPerCore:2
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for this node
slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug:  task/cgroup: now constraining jobs allocated cores
slurmd: debug:  task/cgroup/memory: total:64323M allowed:100%(enforced), swap:0%(permissive), max:100%(64323M) max+swap:100%(128646M) min:30M kmem:100%(64323M permissive) min:30M swappiness:0(unset)
slurmd: debug:  task/cgroup: now constraining jobs allocated memory
slurmd: debug:  task/cgroup: unable to open /etc/slurm-llnl/cgroup_allowed_devices_file.conf: No such file or directory
slurmd: debug:  task/cgroup: now constraining jobs allocated devices
slurmd: debug:  task/cgroup: loaded
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
slurmd: debug:  /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
slurmd: Munge credential signature plugin loaded
slurmd: slurmd version 19.05.5 started
slurmd: debug:  Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug:  job_container none plugin loaded
slurmd: debug:  switch NONE plugin loaded
slurmd: slurmd started on Tue, 11 Aug 2020 07:14:08 +0000
slurmd: CPUs=16 Boards=1 Sockets=1 Cores=8 Threads=2 Memory=64323 TmpDisk=297553 Uptime=15897 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug:  AcctGatherEnergy NONE plugin loaded
slurmd: debug:  AcctGatherProfile NONE plugin loaded
slurmd: debug:  AcctGatherInterconnect NONE plugin loaded
slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)
slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.

Edit3:我从 slurmd.log 文件中得到这些调试消息,这似乎表明无法检索 PID 并且某些 files/folders 不可访问:

[2020-08-11T07:38:27.973] slurmd version 19.05.5 started
[2020-08-11T07:38:27.973] debug:  Job accounting gather NOT_INVOKED plugin loaded
[2020-08-11T07:38:27.973] debug:  job_container none plugin loaded
[2020-08-11T07:38:27.973] debug:  switch NONE plugin loaded
[2020-08-11T07:38:27.973] slurmd started on Tue, 11 Aug 2020 07:38:27 +0000
[2020-08-11T07:38:27.973] CPUs=16 Boards=1 Sockets=1 Cores=8 Threads=2 Memory=64323 TmpDisk=297553 Uptime=17357 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2020-08-11T07:38:27.973] debug:  AcctGatherEnergy NONE plugin loaded
[2020-08-11T07:38:27.973] debug:  AcctGatherProfile NONE plugin loaded
[2020-08-11T07:38:27.974] debug:  AcctGatherInterconnect NONE plugin loaded
[2020-08-11T07:38:27.974] debug:  AcctGatherFilesystem NONE plugin loaded
[2020-08-11T07:38:27.974] debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)
[2020-08-11T07:38:27.975] debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2020-08-11T07:38:33.496] got shutdown request
[2020-08-11T07:38:33.496] all threads complete
[2020-08-11T07:38:33.496] debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
[2020-08-11T07:38:33.496] debug2: xcgroup_get_pids: unable to get pids of '(null)'
[2020-08-11T07:38:33.496] debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
[2020-08-11T07:38:33.496] debug2: xcgroup_get_pids: unable to get pids of '(null)'
[2020-08-11T07:38:33.497] debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
[2020-08-11T07:38:33.497] debug2: xcgroup_get_pids: unable to get pids of '(null)'
[2020-08-11T07:38:33.497] Consumable Resources (CR) Node Selection plugin shutting down ...
[2020-08-11T07:38:33.497] Munge credential signature plugin unloaded
[2020-08-11T07:38:33.497] Slurmd shutdown completing

Edit4:slurmd 处于活动状态,但仅在 运行ning sudo 服务 slurmd 重启之后。 运行 停止然后启动不会激活 slurmd。

● slurmd.service - Slurm node daemon
     Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2020-08-11 08:17:46 UTC; 1min 37s ago
       Docs: man:slurmd(8)
    Process: 28281 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
   Main PID: 28474
      Tasks: 0
     Memory: 1.1M
     CGroup: /system.slice/slurmd.service

Aug 11 08:17:46 node-1 systemd[1]: Starting Slurm node daemon...
Aug 11 08:17:46 node-1 systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not permitted
Aug 11 08:17:46 node-1 systemd[1]: Started Slurm node daemon.
Aug 11 08:18:41 node-1 systemd[1]: slurmd.service: Supervising process 28474 which is not our child. We'll most likely not notice when it exits.

Edit5:另一个可能相关的问题是 sacct 只能 运行 与 sudo,它抱怨日志文件的权限。我尝试将权限更改为 /var/log 但它导致了问题,因为它是一个系统文件夹:

ubuntu@node-0:/data/pangenome_cactus$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
/var/log/slurm_jobacct.log: Permission denied
ubuntu@node-0:/data/pangenome_cactus$ sudo sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
2            cactus_pa+      debug     (null)          0     FAILED    127:0 
3            cactus_pa+      debug     (null)          0    RUNNING      0:0 
3.0          singulari+                (null)          0    RUNNING      0:0 

slurmd 守护进程说 got shutdown request,所以它被 systemd 终止,可能是因为 Can't open PID file /run/slurmd.pid (yet?) after startsystemd配置为如果PID文件/run/slurmd.pid存在则认为slurmd启动成功。但是 Slurm 配置声明 SlurmdPidFile=/var/run/slurmd.pid。尝试将其更改为 SlurmdPidFile=/run/slurmd.pid.

由于我认为我的解决方案有效,因此我将对这个问题添加一些意见。 “Slurmd”似乎并不关心 PidFile 路径是否存在。但是,当 运行 宁作为守护程序时,如果无法写入给定路径,它将 return 错误代码。 Linux 服务捕获错误代码并认为守护进程未能启动,但实际上“slurmd”已经启动。这就是为什么您在尝试再次启动时会收到“地址已在使用”错误的原因。所以,解决这个问题的方法是确保机器启动时 PidFile 路径确实存在。

#解决方案 #1

不要在 /var/run 下创建文件。使用其他一些不是“root”的目录。如果您想使用 /var/run,请转到解决方案 #2。

#解决方案 #2

/var/run是在内存中建立的临时目录。它不会在重启之间持续存在。另一个问题是“/var/run”是针对“root”用户而不是“slurm”的。这就是为什么“slurmd”无权写入它的原因。所以我建议创建 /var/run/slurm 并将所有内容放在那里。

解决这个问题,可以参考《芒格》。如果你执行“ls -l /var/run/”你会注意到“/var/run/munge”有用户“munge”和组“munge”。此外,munge 能够在启动时创建“/var/run/munge”目录。

要在启动时在“/var/run”下创建目录,只需在/usr/lib/tmpfiles.d/slurm.conf 下创建一个文件(这也是 munge 的做法。您可以参考/usr/lib/tmpfiles.d/munge.conf).

d /var/run/slurm 0755 slurm slurm -
d /var/log/slurm 0755 slurm slurm -
d /var/spool/slurm 0755 slurm slurm -

然后,确保您的 slurm.conf、slurmd.service、slurmctld.service 的 PidFile 指向与上述相同的位置。

就是这样。它应该可以解决问题。我还 运行 遇到另一个奇怪的问题,服务在启动时会失败,我不得不在我的服务中添加“Restart=on-failure”和“RestartSec=5”,这样它最终会成功(大约 10~20s ).这不是很整洁,但确实有效。