New to HPC and Open OnDemand

I’m a Linux administrator for my University. A professor purchased two Lambda servers for AI research and asked if I could install Open OnDemand. I said yes thinking it was just another “yum/apt install” request.

After some review it seems clear I want to set up a separate server for Open OnDemand (correct me I’m wrong please). I’ve created a new server in VMware running Debian 12 and I’m installing Open OnDemand this week. I’ll configure authorization through our CAS system.

My goal right now is to get a minimal setup working that provides command line access to the Lambda servers as a replacement for SSH and build from there. My understanding is the two Lambda Servers act as the compute nodes.

My problem, I have zero experience with HPC. I’ve been looking at the diagram in the link below. I don’t understand if I need to have a cluster setup first or if setting up Open OnDemand builds the cluster? Will I need a shared backend storage or can I get away without it for such a simple setup?

In advance, thanks for any advice or direction.

1 Like

Hi Peter:
In general, HPC consists of nodes that are deployed from a common image via some tool like Warewulf, Bright Cluster Manager, lsof… and jobs are scheduled across them by a scheduler such as Slurm, PBS Pro, Platform LSF,… Job submission is done via CLI commands like “sbatch myscript.sh” (for slurm).

Open Ondemand leverages the schedular by providing a web based GUI to create and submit jobs, as well as a web based ssh client to login to (typically) the job submission node for the cluster.

In the absence of a schedular, you could run OOD on one of the customer nodes, as long as accessing that node via https is allowed. The ssh client would then spawn on whatever node you named, possibly including the one being used to host OOD. I would prefer a VM for ood for isolation reasons and to avoid “wjy is ood down/slow/…?” caused by heavy compute load on your customer’s systems.

I’ve not tried a “no cluster” setup, but I think an OOD install with /etc/ood/config/apps/shell/env containing
DEFAULT_SSHHOST=name of a customer node
would allow the user to get a shell prompt on one of his nodes to do whatever. If the user really wants to have a job scheduler, then you’ll likely need another VM to run the slurm/PBS/… scheduling daemons, as well as installing a slurm/PBS/… client on each compute node. Without knowing what your user expects, I wouldn’t open that bucket of work (for you) unless the customer says something :blush:

In call cases where schedulers are involved, the OOD server will need the same /home as the compute nodes have, in order to have access to output, as well as to provide scripts for input.

I hope some of this rambling helps,
Ric

Hi Ric,

Thanks for the reply. I had to look up Warewulf and Bright Cluster Manager, but was confused by “lsof” in this list, I know it “list open files”?

I’m looking at Slurm for the job scheduler. I’ve installed it for a professor in the past on a single server but I’m not familiar with it’s operation (yet).

I’ve installed OOD on a VM running Debian 12. Working on getting auth. working with our CAS server.

I’m sure I’ll have more questions but I’ll post them as they come up.

Have a good weekend and again thank you.

Pete

Pete

Hi Pete:
My fat fingers: replace lsof with losf.
losf == lots of small files
lsof – list open files, and isn’t relevant.
losf is used in conjunction with cobbler (https://cobbler.github.io) (or was when I took the HPC Admin training offered by https://tacc.utexas.edu back in 2018).

Sorry for the confusion,
Ric

Hi Ric,

Figured it was something like, thanks for clarifying :slight_smile:

Pete

Thanks Ric for answering!

Yes you will need shared storage between the VM and your compute nodes. Feel free to ask more questions here as they arise.

Slurm is a great choice for job scheduler.

1 Like

Jeff,

Thanks I was thinking there had to be shared storage. I’m going to need to review options. I suspect a NFS server would be the simplest solution. We have a Storage Area Network with NetApp storage and I suspect a shared file system over SAN would be another option.

Pete

Hello I am piggy backing on this question we have BCM cluster 9.2 at my university and the ITS department is stalling with request we put in to install OOD. They keep thinking we need to purchase additional hardware and reinvent the wheel to install OOD on RHEL 8. Can you confirm my research we can install OOD on virtual machine and point that to one of login nodes and or we can install it physically on a1 of the 25 compute nodes? Do you have a small diagram of installations you recommend for Bright cluster with 2 login nodes and 25 compute nodes so I can send them.
It would be a life saver.

Joe

We have OOD installed in a 16GB/4 core (AMD 2,8GHz) VM running under a ProxMox hypervisor. At this moment, there are 111 concurrent users (with typically 4 processes each) and it’s not a problem. OOD has to mount the same home directory the compute nodes see, and abe ble to submit batch jobs to the scheduling node (eg, able to access port 6817 a slurm server). If you want to allow shell sessions on the login nodes, then the OOD server will also need ssh access to the login nodes.

Cheers,
Ric

I would add we recommend setting OOD up as it’s own stand-alone web-server (which we call a web-node), meaning it has the software stack and runs on a VM or physical machine that integrates with your login solution, and then use that to submit and login to nodes on the compute cluster. Another way to say this, we would advise against installing OOD on a compute cluster.

Over the years we’ve had many people go down that path to just get things going then find they are trapped in the setup and hit many headaches with scaling or managing the deployment.

So the preferred method is a stand-alone web-node for ood preferably on a VM which then points to the login-nodes and compute cluster. We do have a very high level diagram of this here: Architecture — Open OnDemand 4.0.0 documentation