OOD portal with Slurm as a resource manager/two clusters

rgas20 · January 27, 2020, 7:59pm

All,

Sorry to be slow to understand. We have two clusters running Slurm for the resource manager and we are working on adding cluster configuration files now. I have imported the munge key from the first cluster onto the ondemand portal server (running on a separate VM) and have verified it and started up the munge daemon.

First of all, if we are running two separate clusters and when I go to tie in the second one, it looks like I will have to set up two munge keys with two separate munge daemons, with --key-file and --socket options switches (based on the Slurm documentation).

Following that, this is something I’m unclear on and wanted to make sure I understand how this is set up first before I dive into anything else. Do I need to copy the exact same slurm.conf file from the cluster I am trying to tie in with, to /etc/slurm/slurm.conf on the OOD portal node? I have the Slurm binaries installed via rpmbuild on the OOD portal node. Other than that and completing the cluster config files and being able to ssh into the login node for each cluster, that’s as far as I have gotten.

jeff.ohrstrom · January 28, 2020, 3:04pm

Not to make it more confusing - but here are two separate approaches you could use. In both you’d still have to have two OOD cluster configurations.

In the first you specify bin_overrides in the config and use an ssh wrapper. This shells into the login node of the appropriate server and executes the commands. This way you don’t have to worry so much (actually none at all) about configurations on the web portal’s node. There’s a description of how to do that in this topic. And you can search ssh wrapper on this site because it’s come up before.

Note that users have to be able to ssh from the web node to the login node without being prompted for this to work.

The second approach is what you’re thinking and describing. One binary, two daemons and two configs (that use each daemon respectively). Now I think you should be able to copy the slurm.conf from the cluster to the portal node and only have to modify AuthInfo for one or both of the configs. Since the daemons are booting on different sockets the configs will likely have to reflect that. Booting them manually using cli arguments would be very fragile. Using systemd would be a lot stronger but you’d have to put work into ensuring each systemd target (each cluster daemon) is isolated from the other and always boots with the right configuration. Looks like there’s a CONF_FILE environment variable you can use.

While thinking about this a little bit, the first approach seems a lot easier. The second option is probably viable, but you’d have to do it with automation. I imagine doing it by hand is likely going very hard, fragile and in the end, cause a lot of pain.

Hope that helps!

rgas20 · January 29, 2020, 4:42pm

Jeff, very helpful thank you, that helped to clear things up for us. We have job submission working for each cluster with the ssh wrapper for now, great feature and we look forward to customizing it.

danielfr · September 24, 2020, 6:47pm

Hi @rgas20,

would you mind elaborate a bit on how you did achieve this?
I have 2 clusters (one for prod and one for dev) that I would like to offer access to thorough Ondemand.

Thanks a lot!

danielfr · September 24, 2020, 7:54pm

Alright I figured it out reading Jeff’s answer and taking a look at the link he provided.

Note that users have to be able to ssh from the web node to the login node without being prompted for this to work.

This is very important too.

Sorry for the noise and thank you very much.

Topic		Replies	Views
Multiple SLURM clusters and OnDemand Get Help question	5	1824	May 17, 2022
Slurm job submission errors Get Help question	17	2401	August 8, 2023
Need help with Multi cluster setup Get Help	9	54	August 25, 2025
Multiple clusters (multiple Slurm Schedulers) with Seperate ldap's Feature Requests and Roadmap Discussion question	5	1779	May 26, 2022
Remote Cluster Submission Get Help question	2	241	July 2, 2024

OOD portal with Slurm as a resource manager/two clusters

Related topics