Multiple SLURM clusters and OnDemand

msgambati-INL · May 11, 2021, 6:03pm

Hi everyone,

We are a PBS shop and are starting to play around with SLURM on our clusters. We have everything working fine with one cluster, but now that we are adding another one I’m not sure what we need to do to get it to work. We have the two clusters setup independently, so a slurmdbd, slurmctld, etc. on each. We setup the clusters.d files to point to the appropriate slurm.conf files, but when we go to submit an interactive app it says it has an unrecognized cluster id.

Do we need to setup SLURM in a multi-cluster configuration in order to get this to work? Is there another way to set this up without that? We noticed a few other tickets talking about this, but they were not exactly our issue. This is most likely due to our inexperience with SLURM, so any help would be appreciated.

Please let me know what files/information would be helpful in answering this question and I’ll get it added.

Thanks,

Matt

mario · May 12, 2021, 4:57pm

@msgambati-INL This should be a quick fix! Can you share your cluster configuration file for your second cluster in clusters.d/?

msgambati-INL · May 12, 2021, 7:34pm

Here you go:

# /etc/ood/config/clusters.d/[cluster_1].yml
---
v2:
  metadata:
    title: "[cluster_1]"
    url: "http://[hostname_1]/hardware/[cluster_1]"
    hidden: false
  login:
    host: "[hostname_2]"
  job:
    adapter: "slurm"
    host: "[hostname_3]"
    bin: "/opt/slurm/bin"
    conf: "/opt/slurm/etc/slurm_[cluster_1].conf"
  acls:
  - adapter: "group"
    groups:
      - "[user_1]"
      - "[user_2]"
      - "[user_3]"
      - "[user_4]"
    type: "whitelist"
  batch_connect:
    basic:
      script_wrapper: |
        module purge
        %s
    vnc:
      script_wrapper: |
        module purge
        module use /apps/system/modulefiles
        module load ood_vnc
        %s

jeff.ohrstrom · May 13, 2021, 1:48pm

You may need v2.job.cluster in the cluster.d file. I think with that and the separate config files (that you have configured there) it may work.

When you’re in a shell session on that machine do you need to use the -M flag? That’s what v2.job.cluster will provide. I guess that’s my next question, what command args do you have to provide on the machine to submit jobs to either cluster?

brandon-biggs · May 13, 2021, 9:29pm

When we add v2.job.cluster in the cluster.d file we get the following error when submitting to the cluster via the interactive form

sbatch: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused
sbatch: error: Sending PersistInit msg: Connection refused
sbatch: error: Sending PersistInit msg: Connection refused
sbatch: error: DBD_GET_CLUSTERS failure: Connection refused
sbatch: error: Problem talking to database
sbatch: error: There is a problem talking to the database: Connection refused.  Only local cluster communication is available, remove --cluster from your command line or contact your admin to resolve the problem.

If I’m on the cluster, I don’t need to use the -M flag. For example, while on a login node for the system we’re trying to submit to, the following command successfully submits an interactive job - srun -A <code> -n 16 -G 1 --time=0-03:00 --pty bash -i. If I try that on the ondemand server via the command line, I get sent to the first slurm cluster.

Using the -M flag from the command line via the ondemand server, I get an error srun: error: Application launch failed: Communication connection failure.

Edit: The -M flag on the command line via the ondemand server does work when I change the default slurm.conf to be the correct slurm.conf of the server. Otherwise if the slurm.conf is the other server’s configuration, it doesn’t work, which makes sense. I’m not sure if there’s a flag in srun or sbatch to specify a config file.

system · May 17, 2022, 6:43pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OOD portal with Slurm as a resource manager/two clusters Get Help question	5	1887	May 26, 2022
Slurm job submission errors Get Help question	17	2314	August 8, 2023
Multiple clusters from a single app Get Help	11	1000	May 26, 2022
Issue Getting Jobs to Submit to the Slurm Cluster via OnDemand (RHEL 8.6, OnDemand 2.0.27) Get Help ondemand2 , question	12	901	January 4, 2023
Multiple clusters (multiple Slurm Schedulers) with Seperate ldap's Feature Requests and Roadmap Discussion question	5	1767	May 26, 2022

Multiple SLURM clusters and OnDemand

Related topics