in my prod cluster file I have removed the “cluster” option from job. Since I was getting the below error-
sbatch: error: No cluster 'reshpcusw2cprd' known by database.
sbatch: error: 'reshpcusw2cprd' can't be reached now, or it is an invalid entry for --cluster. Use 'sacctmgr list clusters' to see available clusters.
after removing the “cluster” option, I am able to submit an interactive session job, but its going to “Queued” state.
That’s all OnDemand can do. If your job is being queued, then the communication to Slurm is OK. If it’s being Queued, then that’s an issue with Slurm and/or the job request.
If you’re being Queued, that could be just fine. It could be the case that the system just has to wait to fulfill your request. On the other hand, you could be requesting something that Slurm cannot fulfill at all, and it just waits forever for a resource to become available when that resource in fact does not even exist.
I’d ask what are the parameters of the interactive job you’re trying to submit? I.e., which queue, how many CPUs, how much memory and so on? Then when you find those, I’d ask if your system can actually ever fulfill that request. (I.e., does such a machine even exist).
That’s definitely the issue (and a bit of a gap in our documentation). You need to setup host-based SSH authentication between your OnDemand node and the production servers. There are a few threads on discourse here about that if you search for it.
the standard procedure in our environment is to use password based authentication for individual users. For an alternate approach is it possible to connect the PROD cluster to OOD like we have done with DEV?
There are two things at play here: (possibly 3 which is you have a development Slurm cluster. I think that complicates things, as we at OSC don’t have a development Slurm cluster, only production clusters)
One is how you’re communicating with Slurm. To communicate with Slurm on the same machine, simply set that machine up as you might a login host. That is, it has Slurm binaries on it, munge, and all the configurations you need to communicate with Slurm. Testing this does not require OnDemand, so don’t move ahead on this until you can interact with that cluster through the command line on that same server. OnDemand will be issuing commands just like you do on the command line, so if it works for you in a command line, it’ll work for OnDemand.
The second issue is mulit-cluster support. Here you need a different cluster.d file for every cluster. So you’d have /etc/ood/config/clusters.d/first_cluster.yml and /etc/ood/config/clusters.d/second_cluster.yml.
Here’s an example I pulled from our system. So you’d have first_cluster.yml point to a .conf file that corresponds to that cluster.
OnDemand will see these two clusters as two separate and distinct clusters.
Again, remove OnDemand from the question and ask - how can I get this machine to be able to submit to 2 Slurm clusters. When it works from the CLI, then it’ll work for OnDemand. In your cluster.d file, supply the .conf location for each cluster (as there should be one for each Slurm cluster on the machine) as I have in my comment above.
Get it to work from the CLI first. Then you can get it to work from OnDemand.