Open Ondemand job submit fails with cluster unavailable error

Version: 3.1.14 and 15

Randomly after 4+ years of working flowlesly, ondemand started failing to submit jobs with the error “Sbatch error “medicinebow’ can’t be reached now or it is an invalid entry for –cluster. Use sacctmgr …”

We dicovered that restarting the slurmctld daemon on the managment node corrects the issue and allows jobs to run.

Teting slurm command from the CLI indicate that slurm is up and working as expected, i.e. all slurm commands work. We tried rebooting OOD VM but the issue continues after the reboot.

Ther are no errors in the controller logs to indicate that there is/was a slurm fault.

We have been trying to figure out the cause of this issue to no avail.

Any ideas of what to look at when we see this issue? Were at a loss.

The place to look will be the logs. We have a page to go over all that here: Logging — Open OnDemand 4.0.0 documentation

The session logs seem like the best bet to see what’s up. Sounds like it doesn’t recognize the cluster anymore which I don’t know why that would simply stop working, were there really no changes to the system at all? Updates of some kind, anything?