TL;DR My department needs to set up an OnDemand instance with zero reliance on a login host.
I have a couple of working OnDemand instances for two different clusters:
The OnDemand instance is a VM, with the cluster config specifying an already existing bare-metal login host, and the same hostname in the optional submit_host key. The VM does not and could not possibly have access to the cluster nodes (they have unroutable IP addresses).
The OnDemand instance is a VM, with the cluster config only specifying a login host. Jobs are submitted via the slurm controller and the passenger apps go through the login host. The OnDemand instance could communicate directly to the cluster nodes, but I believe it is not necessary and I’m pretty sure the firewall configs don’t allow it currently.
I need to present all alternative architectures and configurations and make sure that I am not missing anything. The reason is that different departments are permitted to use different login hosts, which carry hostnames that have been established for many years, multiple generations of computational clusters, various agreements, etc.
Option 3 would likely be to install OnDemand on top of all three bare-metal login hosts.
Option 4 may be to set up a new login host in the cluster that all departments are able to access, and run that as the OnDemand portal.
I am security conscious about the third and fourth options. I am sure they would function but I see them as being an excessive risk, and both options have a reliance on a login host which is being challenged.
Are there any other options? I’m getting some pushback on the reliance of OnDemand on a logon server, which seems like a core requirement for interactive apps. I think there is a perception that because the slurm binaries and munge are required on the OnDemand host that all functionality is possible via direct interaction between OnDemand and the slurm controller.
I’m a bit confused about your use of the term ‘login host’? What does that really mean on your end.
Fundamentally, we like to always say that your OnDemand instance should be treated just like you would a login node. The main reason is fundamentally that’s what it is!
Put another way, what are the list of things a login node conceptually needs to do:
Provide a way for users to authenticate and ‘login’
Provide access to some sort of storage / file system that is also connected to the compute resources
Provide access to some sort of resource scheduler / manager that is also connected to the compute resources
So in order to have Open OnDemand work, the instance fundamentally needs to be able to:
Connect to some sort of identity provider to authenticate
Connect to some sort of storage / file system to read/write files
Connect to some sort of resource scheduler / manager to request / utilize compute resources
(Optional) have trust relationships setup to be able to connect directly to compute resources if you want interactive app / desktop functionality
Since your pushback seems to be regarding the interactive capabilities, yes, some sort of trust / routing needs to be available if you want to enable that. Fundamentally that makes sense if you think about it.
Thank you for looking into this. The “login host” is the term I’m using to represent a couple of things. Primarily the “host” included in the minimal cluster config example.
The common cluster configuration that I have seen always has at least one, but usually at least a few ssh login hosts that researchers can connect to, and a lot of cluster nodes (the ones running slurmd) intentionally unreachable via ssh directly (e.g. unroutable IP or firewall).
Are there any interactive capabilities (e.g. jupyter or rstudio) that can be successfully offered to the researchers in which the OnDemand instance sits outside of the cluster and relies fully on connectivity to the slurm controller with no direct connectivity to the cluster nodes?
I’d say no. One other ‘maxim’ we often mention when people ask if OnDemand can do ‘X’ is that if a knowledgeable regular client can do ‘X’ already in your environment with some combination of existing tools (i.e. SSH, X-Windows, etc. etc) then yes OnDemand can do X.
So the question I’d pose back is do you know of any way that an existing client can rely on just connectivity to your Slurm controller to accomplish ‘interactive capabilities’? I’m not generally aware of any way to do that by ‘proxying’ through Slurm.
Fundamentally, to have any type of interactivity clients need to be able to directly connect (or via some sort of proxy / relay) to the compute resources providing the underlying capabilities. The ‘traditional way’ is what you describe: you ssh to a login node, then from that login node you ssh to a cluster node.
Open OnDemand is setup to do the same thing fundamentally, but needs to be configured the same way as you configure your existing login nodes (particularly since you are fundamentally replacing the login nodes for many of your clients).
I’m a bit confused about what the concerns are for doing that?
Maybe phrasing the question differently will help too. Just to be crystal clear let me make sure I understand:
For the OnDemand VM to function independently of a separate login host it still needs to have two basic, essential networking connectivities.
Inbound connectivity from the researchers.
SSH connectivity to the cluster nodes.
I’ve heard it described as a “Hard shell with a gooey center”
In our case, our architecture physically allows three machines to have that connectivity, and I am referring to them as “login hosts”.
Let me make sure I am correct - OnDemand can maybe function on its own, but only if it can reach the cluster compute nodes (as if it is its own login host).
No. OOD is just a rails app, so it can’t compute anything.
You can setup OOD on some VM, not give it any access to or knowledge of a compute cluster, and it will work. It just won’t submit jobs to anything or allow you to interact with a compute cluster. Is this what you are asking? I’m a bit confused myself reading this.
Nothing in OnDemand actually requires a login node that I’m aware of, it’s just a feature you can choose to use.
The general best way to deploy OnDemand is a dedicated host like a VM and that VM needs to have the scheduler binaries and be able to communicate to the scheduler, so for Slurm the OnDemand VM needs to be able to submit jobs with sbatch.
The login host part is for the OnDemand Shell app so that users can login from the OnDemand web portal to a login node so using that approach you’d need to allow SSH from the OnDemand VM to the configured login node.
It sounds like you have many different login nodes, so in that case you could configure multiple “clusters” that define each login node and limit each “cluster” to the desired POSIX group by doing something like chmod 0640 <path to cluster YAML> and chgrp <cluster group> <path to cluster YAML>.
If you do not want the Shell login app then do the below and the Shell app won’t be accessible to users:
Thanks Travis. My main takeaway is that if the compute nodes have unroutable IP addresses then the OnDemand instance needs to be in that same building with dual-homed networking just like the login hosts.