Error After Slurm Upgrade

We just upgraded our slurm back end to 17.02.11 and broke the OnDemand front end(s). Interactive Apps gives an error “ERROR: OodCore::JobAdapterError - sbatch: error: slurm_receive_msg:” “Zero Bytes were transmitted or receive”.

Active Jobs doesn’t give an error but also doesn’t display any results.

This happens on our original production non-rpm based install as well as rpm install (OOD 1.4) on dev server. Oddly enough another 1.4 rpm install works just fine. Any ideas?

The slurm clients on all the submit nodes are still at 15.x and everything worked before the backend upgrade.

1 Like

Oddly enough another 1.4 rpm install works just fine

Which 1.4 rpm version failed and which one succeeded?

Before the Slurm upgrade everything works but now,

  1. Failed: Original production server running OOD 1.2
  2. Failed: Dev server running OOD 1.4 RPM install.
    ondemand-1.4.10-1.el6.x86_64
    ondemand-release-web-1.4-1.el6.noarch
  3. Working: Sequestered production server running OOD 1.4 RPM install.
    ondemand-release-web-1.4-1.el7.noarch
    ondemand-1.4.10-1.el7.x86_64

For 2, 3 maybe it is because #2 is running RHEL 6 while #3 is running RHEL 7. If I can get the RHEL 6 box working we may very well upgrade #1 and finally move that to an RPM install.

Let me know if this helps.

I can replicate this outside of slurm with -M or --clusters aka any command that specifies a cluster. What file can I edit in OOD to turn off the -M option? This will help me get it back up and running (hopefully) while looking at the larger slurm issue. I could just be grasping a straws with this idea but we shall see.

In the cluster config in the job section if you omit cluster: then -M will no longer be used in the commands

Thanks tremendously for your help, that was it! Everything is working now.

You are welcome! Is there something I could change to https://osc.github.io/ood-documentation/master/installation/resource-manager/slurm.html to better call attention to this?

I would just specifically say “remove entry on non-multi cluster otherwise it may cause errors depending on the version of OOD and scheduler”. In my case, everything (fortunately) worked this entire time, up until today.

I stumbled upon this in my upgrade from 20.11 to 21.08… if anyone else stumbles on this… this is what I found:
I ran “sacctmgr show clust format=Cluster,ControlHost,ControlPort,RPC”
and the ControlHost showed 127.0.0.1 vs an IP that OOD could use to reach slurmctld.

This lead me to realize that I had added an /etc/hosts entry for the slurmctld host during the upgrade… which was being passed out to the ondemand host. The ondemand host could be seen faithfully reaching out to localhost… which didn’t work.

I claim that removing -M does work, but the underlying issue… for me at least would have revealed itself again if I added another cluster in the future.