Error After Slurm Upgrade

shawn.doughty · February 20, 2019, 4:44pm

We just upgraded our slurm back end to 17.02.11 and broke the OnDemand front end(s). Interactive Apps gives an error “ERROR: OodCore::JobAdapterError - sbatch: error: slurm_receive_msg:” “Zero Bytes were transmitted or receive”.

Active Jobs doesn’t give an error but also doesn’t display any results.

This happens on our original production non-rpm based install as well as rpm install (OOD 1.4) on dev server. Oddly enough another 1.4 rpm install works just fine. Any ideas?

The slurm clients on all the submit nodes are still at 15.x and everything worked before the backend upgrade.

efranz · February 20, 2019, 6:11pm

Oddly enough another 1.4 rpm install works just fine

Which 1.4 rpm version failed and which one succeeded?

shawn.doughty · February 20, 2019, 6:39pm

Before the Slurm upgrade everything works but now,

Failed: Original production server running OOD 1.2
Failed: Dev server running OOD 1.4 RPM install.
ondemand-1.4.10-1.el6.x86_64
ondemand-release-web-1.4-1.el6.noarch
Working: Sequestered production server running OOD 1.4 RPM install.
ondemand-release-web-1.4-1.el7.noarch
ondemand-1.4.10-1.el7.x86_64

For 2, 3 maybe it is because #2 is running RHEL 6 while #3 is running RHEL 7. If I can get the RHEL 6 box working we may very well upgrade #1 and finally move that to an RPM install.

Let me know if this helps.

shawn.doughty · February 20, 2019, 7:52pm

I can replicate this outside of slurm with -M or --clusters aka any command that specifies a cluster. What file can I edit in OOD to turn off the -M option? This will help me get it back up and running (hopefully) while looking at the larger slurm issue. I could just be grasping a straws with this idea but we shall see.

efranz · February 20, 2019, 8:08pm

In the cluster config in the job section if you omit cluster: then -M will no longer be used in the commands

shawn.doughty · February 20, 2019, 8:52pm

Thanks tremendously for your help, that was it! Everything is working now.

efranz · February 20, 2019, 9:20pm

You are welcome! Is there something I could change to https://osc.github.io/ood-documentation/master/installation/resource-manager/slurm.html to better call attention to this?

shawn.doughty · February 21, 2019, 1:48am

I would just specifically say “remove entry on non-multi cluster otherwise it may cause errors depending on the version of OOD and scheduler”. In my case, everything (fortunately) worked this entire time, up until today.

NathanielMiddleton · October 28, 2021, 5:18pm

I stumbled upon this in my upgrade from 20.11 to 21.08… if anyone else stumbles on this… this is what I found:
I ran “sacctmgr show clust format=Cluster,ControlHost,ControlPort,RPC”
and the ControlHost showed 127.0.0.1 vs an IP that OOD could use to reach slurmctld.

This lead me to realize that I had added an /etc/hosts entry for the slurmctld host during the upgrade… which was being passed out to the ondemand host. The ondemand host could be seen faithfully reaching out to localhost… which didn’t work.

I claim that removing -M does work, but the underlying issue… for me at least would have revealed itself again if I added another cluster in the future.

Topic		Replies	Views
Slurm job submission errors Get Help question	17	2333	August 8, 2023
Issue Getting Jobs to Submit to the Slurm Cluster via OnDemand (RHEL 8.6, OnDemand 2.0.27) Get Help ondemand2 , question	12	914	January 4, 2023
Failed to submit a job after upgrading slurm version Get Help	3	1257	May 26, 2022
SSH Slurm’s job submission and control clients Get Help question	4	95	May 31, 2025
Job composer and desktop not working Get Help ondemand2 , question	8	308	March 11, 2024

Error After Slurm Upgrade

Related topics