Multiple clusters from a single app

I’ll blame Matthew and Brandon from Idaho National Lab for this but am asking the OOD developers for help :wink:

The issue is, INL has successfully modified their Desktop and Jupyter app to include multiple clusters in a single app, by an extra parameter in the app’s form.yml (or desktop’s cluster.yml), like
pbs_cluster:
widget: select
label: “Cluster”
options:
- [“Cluster1”, “cluster1”]
- [“Cluster2”, “cluster2”]

and then feeding this pbs_cluster variable to the submit.yml.erb:
- “-q”
- “<%= pbs_queue %>@<%= pbs_cluster %>”

We use SLURM so this PBS solution does not work for us, but, we do use a single slurmdbd for all our clusters so we can cross-submit jobs with the -M flag (sbatch -M cluster1 …).

So, I added into our setup a generic cluster that uses SLURM binaries that work across all our clusters, and use -M in the submit.yml.erb to direct the job to a specific cluster:
script:
native:
- “-M”
- “<%= slurm_cluster %>”

In the process, I discovered that OOD has in ln 279 of gems/ood_core-0.11.3/lib/ood_core/job/adapters/slurm.rb hard coded the -M flag:
args += ["-M", cluster] if cluster

I tried to comment out the flag (since I feed it in through the submit.yml.erb), and, that does submit the job with the app (desktop) starting on the compute node correctly, but, OOD Interactive Sessions does not know about this job. I suspect because OOD behind the scenes queries the SLURM about the job status and since I removed the -M from the SLURM adapter, the commands like squeue don’t have the appropriate cluster name.

Perhaps there could be a simple fix to this for us (rather than waiting for future OOD release that should allow this), which is I am asking for feedback.

I guess the simplest way would be to set the cluster variable in the slurm.rb to the slurm_cluster variable that I define in the submit.yml.erb, but, I don’t know the complexities of how all these things interact behind the scenes.

I appreciate any thoughts on this.

MC

I’m actually surprised this works for PBS. Maybe it works for PBS only because 1 binary can submit to and query multiple clusters? So the cluster_id in their database file (~/ondemand/data/sys/dashboard/batch_connect/db/) is incorrect, but qstat` continues to search for that job id on other clusters, finds it and returns the data to OOD? (I’m only guessing as to why it works for them)

So, if you’ve configured your application as cluster: vulcan but you actually were able to submit to the cluster romulus you would need to somehow enable the binary configured in /etc/ood/config/clusters.d/vulcan.yml to be able to query both vuclan and romulus clusters because OOD will use that cluster config to run squeue.

But when would it query both clusters? It seems like you could create a wrapper script that can interact with both clusters and pass an environment variables that could tell it which cluster to interact with (or which binary to use). In this wrapper script you could catch the -M option and modify as you see fit and use the appropriate binary.

These are my initial thoughts on it. I’m not sure how easy this endeavor would be but I do know we’re adding this functionality to the next release, so it isn’t too very far away.

We’re supporting multiple clusters with our apps but our use case is different. Our main
cluster is very busy, with many queues and a complex usage policy. A scheduler run
typically takes 30 - 60 seconds and occasionally much longer (say when someone
submits 10,000 jobs that fail immediately.) We have provisioned dedicated resources to
support a certain class of interactive ood jobs but if they went through the main
scheduler they would experience unpleasantly long startup times.

We created a second cluster to support this class of jobs. But we don’t want the users to
have to think about it. They should just be able to request whatever resources they want
and their job should be sent to the appropriate place automatically. So the app forms
just reference the main scc cluster. The redirection to the ood cluster happens in the
wrapper scripts. We use SGE. The SGE qsub command has an option to query if a job
request can be started immediately on a cluster. So the qsub wrapper asks the ood
cluster if it can run the job and submits it there if the answer is yes, otherwise it submits
to the scc cluster. In order for OnDemand to track the jobs in the ood cluster the qstat
wrapper checks to see if a job with the right job_id and user exists in the ood cluster. If
so, the qstat request goes there, otherwise it goes to the scc cluster. The qdel wrapper
is similar.

This has been working without issue since last September but in a few months the job
ids in the scc cluster will roll over and then catchup with the job ids in the ood cluster.
So it will be possible for the same user to have jobs with the same job id in both clusters
and OnDemand won’t be able to track the one in the scc cluster. I think the probability of
this happening is low but we’ll see. I hope the future multi-cluster support will allow me
to eliminate this possibility. I would just need to be able to modify the cluster specified in
the form after the form is submitted but before the actual job submission occurs.

Thanks Jeff and Mike,

I think we’re stuck with SLURM because of the hard coded “-M cluster” option. Since the multi-cluster support is not too far out, I’ll wait till it’s officially supported, and looking forward to that.

MC

Hi,

I just wanted to report that I have upgraded to 1.8 and new multi-cluster support works perfectly for my use case. Thanks!

1 Like

Thanks! But please let us know what use cases you need to accommodate for.

Here’s on of ours as an example: different modules available on different clusters. So we have to hide/show select options for version (the module versions) depending on what cluster is chosen.

So we’ve written some javascript to handle this, but we’d like to share it as helpers so other folks can also easily do the same we just have to get a sense of what all we need to cover (or at least what we could provide that would cover a lot).

Jeff, I explained my use case above on Jun 8. The ability to choose the cluster based on the
form input is all I need. The kludge I had previously implemented is now done with a couple of lines of ruby. Thanks.

Hi Jeff, I just got OOD 1.8 installed on our test instance and going through the apps to modify them to support multiple clusters.

It would be nice to have the form.yml dynamic in such a way that it’d e.g. display different attribute help for a different cluster. Is that possible, e.g. with the Javascript that you mention?

If it is it’d be nice to have an example, I am still wrapping my head around what things are used around the form.yml to render the job parameters webpage. Perhaps documenting a workflow of what happens when the webpage gets generated would help in understanding the process and the pieces that contribute to it.

Thanks,
MC

Hey sorry for the delay - You can see our jupyter deployment for an example javascript for how we toggle the CUDA option or what nodes are available.

You can watch this ticket and/or comment on it for this feature. But we would like to add this into the core distribution so it becomes easier for admins to enable this type of interactivity.

@mcuma please reach out to Jeff and I directly via email about this. We could help you with the JavaScript that is needed in the short term but after the OOD2.0 stable release we would like to prioritize extending the form.yml DSL to better support these types of cases, so if we understand your use case specifics that may help us build the right extension.

I did not make any progress on this, the multiple-cluster setting works OK and users are sort of used not to use the advanced settings on clusters where they don’t apply.

That said, let me look at 2.0 when it gets released next week and then write back on what dynamism would be nice to have from our standpoint