I’m a Linux administrator for my University. A professor purchased two Lambda servers for AI research and asked if I could install Open OnDemand. I said yes thinking it was just another “yum/apt install” request.
After some review it seems clear I want to set up a separate server for Open OnDemand (correct me I’m wrong please). I’ve created a new server in VMware running Debian 12 and I’m installing Open OnDemand this week. I’ll configure authorization through our CAS system.
My goal right now is to get a minimal setup working that provides command line access to the Lambda servers as a replacement for SSH and build from there. My understanding is the two Lambda Servers act as the compute nodes.
My problem, I have zero experience with HPC. I’ve been looking at the diagram in the link below. I don’t understand if I need to have a cluster setup first or if setting up Open OnDemand builds the cluster? Will I need a shared backend storage or can I get away without it for such a simple setup?
Hi Peter:
In general, HPC consists of nodes that are deployed from a common image via some tool like Warewulf, Bright Cluster Manager, lsof… and jobs are scheduled across them by a scheduler such as Slurm, PBS Pro, Platform LSF,… Job submission is done via CLI commands like “sbatch myscript.sh” (for slurm).
Open Ondemand leverages the schedular by providing a web based GUI to create and submit jobs, as well as a web based ssh client to login to (typically) the job submission node for the cluster.
In the absence of a schedular, you could run OOD on one of the customer nodes, as long as accessing that node via https is allowed. The ssh client would then spawn on whatever node you named, possibly including the one being used to host OOD. I would prefer a VM for ood for isolation reasons and to avoid “wjy is ood down/slow/…?” caused by heavy compute load on your customer’s systems.
I’ve not tried a “no cluster” setup, but I think an OOD install with /etc/ood/config/apps/shell/env containing
DEFAULT_SSHHOST=name of a customer node
would allow the user to get a shell prompt on one of his nodes to do whatever. If the user really wants to have a job scheduler, then you’ll likely need another VM to run the slurm/PBS/… scheduling daemons, as well as installing a slurm/PBS/… client on each compute node. Without knowing what your user expects, I wouldn’t open that bucket of work (for you) unless the customer says something
In call cases where schedulers are involved, the OOD server will need the same /home as the compute nodes have, in order to have access to output, as well as to provide scripts for input.
Thanks for the reply. I had to look up Warewulf and Bright Cluster Manager, but was confused by “lsof” in this list, I know it “list open files”?
I’m looking at Slurm for the job scheduler. I’ve installed it for a professor in the past on a single server but I’m not familiar with it’s operation (yet).
I’ve installed OOD on a VM running Debian 12. Working on getting auth. working with our CAS server.
I’m sure I’ll have more questions but I’ll post them as they come up.
Hi Pete:
My fat fingers: replace lsof with losf.
losf == lots of small files
lsof – list open files, and isn’t relevant.
losf is used in conjunction with cobbler (https://cobbler.github.io) (or was when I took the HPC Admin training offered by https://tacc.utexas.edu back in 2018).
Thanks I was thinking there had to be shared storage. I’m going to need to review options. I suspect a NFS server would be the simplest solution. We have a Storage Area Network with NetApp storage and I suspect a shared file system over SAN would be another option.