Version: 3.1.14 and 15
Randomly after 4+ years of working flowlesly, ondemand started failing to submit jobs with the error “Sbatch error “medicinebow’ can’t be reached now or it is an invalid entry for –cluster. Use sacctmgr …”
We dicovered that restarting the slurmctld daemon on the managment node corrects the issue and allows jobs to run.
Teting slurm command from the CLI indicate that slurm is up and working as expected, i.e. all slurm commands work. We tried rebooting OOD VM but the issue continues after the reboot.
Ther are no errors in the controller logs to indicate that there is/was a slurm fault.
We have been trying to figure out the cause of this issue to no avail.
Any ideas of what to look at when we see this issue? Were at a loss.