Need help with Multi cluster setup

I have two cluster config files saved at /etc/ood/config/cluster.d/

one for dev and one for prod cluster.

hpc-dev-cluster.yml 
---
v2:
  metadata:
    title: "HPC Cluster"
    url: "https://localhost"
  login:
    host: "localhost"
    user: "%{user}"
    default: true
    auth: "ssh"
  job:
    adapter: "slurm"
    cluster: "reshpcusw2cdev"
    bin: "/usr/bin"
    strict_host_checking: false
hpc-prod-cluster.yml 
---
v2:
  metadata:
    title: "HPC Cluster PROD"
    url: "https://grhpc2.na.prod.com"
  login:
    host: "grhpc2.na.prod.com"
    user: "%{user}"
    default: true
    auth: "ssh"
  job:
    adapter: "slurm"
    bin: "/usr/bin"
    strict_host_checking: false

in my prod cluster file I have removed the “cluster” option from job. Since I was getting the below error-

sbatch: error: No cluster 'reshpcusw2cprd' known by database.
sbatch: error: 'reshpcusw2cprd' can't be reached now, or it is an invalid entry for --cluster.  Use 'sacctmgr list clusters' to see available clusters.

after removing the “cluster” option, I am able to submit an interactive session job, but its going to “Queued” state.

That’s all OnDemand can do. If your job is being queued, then the communication to Slurm is OK. If it’s being Queued, then that’s an issue with Slurm and/or the job request.

If you’re being Queued, that could be just fine. It could be the case that the system just has to wait to fulfill your request. On the other hand, you could be requesting something that Slurm cannot fulfill at all, and it just waits forever for a resource to become available when that resource in fact does not even exist.

I’d ask what are the parameters of the interactive job you’re trying to submit? I.e., which queue, how many CPUs, how much memory and so on? Then when you find those, I’d ask if your system can actually ever fulfill that request. (I.e., does such a machine even exist).

I realised something, for doing SSH into the PROD cluster (login node) from the OOD server.

when I run-

ssh username@grhpc2.na.prod.com

it asks for a password.

In case of DEV cluster it is connected to OOD, so I am using localhost to submit the job which does not require password.

That’s definitely the issue (and a bit of a gap in our documentation). You need to setup host-based SSH authentication between your OnDemand node and the production servers. There are a few threads on discourse here about that if you search for it.

the standard procedure in our environment is to use password based authentication for individual users. For an alternate approach is it possible to connect the PROD cluster to OOD like we have done with DEV?

sacctmgr list clusters
   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS 
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- --------- 
reshpcusw+    10.12.6.22         6817  9984         1                                                                                           normal      

Yes, absolutely. This is likely what you should do given your being prompted for passwords.

Hi Jeff,

Could you help me with this? I am not sure how it was added initially, I am fairly new to this. how do I add another cluster?

There are two things at play here: (possibly 3 which is you have a development Slurm cluster. I think that complicates things, as we at OSC don’t have a development Slurm cluster, only production clusters)

One is how you’re communicating with Slurm. To communicate with Slurm on the same machine, simply set that machine up as you might a login host. That is, it has Slurm binaries on it, munge, and all the configurations you need to communicate with Slurm. Testing this does not require OnDemand, so don’t move ahead on this until you can interact with that cluster through the command line on that same server. OnDemand will be issuing commands just like you do on the command line, so if it works for you in a command line, it’ll work for OnDemand.

The second issue is mulit-cluster support. Here you need a different cluster.d file for every cluster. So you’d have /etc/ood/config/clusters.d/first_cluster.yml and /etc/ood/config/clusters.d/second_cluster.yml.

Here’s an example I pulled from our system. So you’d have first_cluster.yml point to a .conf file that corresponds to that cluster.

OnDemand will see these two clusters as two separate and distinct clusters.

---
v2:
  job:
    adapter: slurm
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm-cardinal.conf"

On my OOD server, I can run slurm commands, its like a login node for DEV.

these are my two cluster yaml files

DEV-

---
v2:
  metadata:
    title: "HPC Cluster"
    url: "https://localhost"
  login:
    host: "localhost"
    user: "%{user}"
    default: true
    auth: "ssh"
  job:
    adapter: "slurm"
    cluster: "reshpcusw2cdev"
    bin: "/usr/bin"
    strict_host_checking: false

PROD-

---
v2:
  metadata:
    title: "HPC Cluster PROD"
    url: "https://grhpc2.na.prod.com"
  login:
    host: "grhpc2.na.prod.com"
    user: "%{user}"
    default: true
    auth: "ssh"
  job:
    adapter: "slurm"
    cluster: "reshpcusw2cprd"
    bin: "/usr/bin"
    strict_host_checking: false
 

Is it possible to connect both the DEV and PROD login nodes on the same OOD server?
and how can I do it?

when I run “sinfo” or “squeue” it shows all the DEV nodes and so I am able to submit jobs on the local host. I want the same thing for PROD.

Again, remove OnDemand from the question and ask - how can I get this machine to be able to submit to 2 Slurm clusters. When it works from the CLI, then it’ll work for OnDemand. In your cluster.d file, supply the .conf location for each cluster (as there should be one for each Slurm cluster on the machine) as I have in my comment above.

Get it to work from the CLI first. Then you can get it to work from OnDemand.