Our group currently has a single ondemand instance on a VM that works with our production slurm cluster. I am working to move OOD to our slurm login nodes, and that has been working in testing just fine.
My question is, would there be any known complications if I have multiple login nodes with OOD all configured to use the same URL? For example:
DNS name: mycluster.hpc.myuniversity.edu
Resolves to: << IP addresses of my login nodes >>
OOD /etc/ood/config/ood_portal.yml on all login nodes:
servername: mycluster.hpc.myuniversity.edu
Effectively if users access https://mycluster.hpc.myuniversity.edu they would access OOD running on any of our login nodes and do their stuff on our production slurm cluster. Would this cause any problems vs having individual OOD servernames per login node? So far in my testing I’ve not run into any issues with this set up.
The short answer is: it’s possible, but it’s not a recommended practice.
If you put OOD on a login node, you’re mixing two very different concerns because the login node now ends up handling OOD concerns like spawning PUNs, handling authentication, serving web requests, and doing all the normal login-node duties at the same time. I’ve seen sites do this, but it often leads to issues with disk usage, load, or general degradation over time.
It’s hard to predict exactly what problems you’ll run into, but some common ones include degraded performance on the login nodes, stale PUNs spun up across different nodes due to round-robin access, sticky-session issues, and the extra complexity of managing certs and secrets consistently across nodes. And since this isn’t a best-practice setup, it can also be harder to get help or find others running it the same way.
We would recommend using a dedicated VM (or two behind a load balancer) for OOD instead. Expanding OOD across N login nodes usually just adds operational complexity and may cause much larger headaches down the road. Also, ansible and puppet may help with that headache and I’d highly recommend them to help with managing that complexity, though we recommend using these regardless which path you decide
We’ve been using a dedicated VM for OOD for some time, and are looking to move that service to login nodes. We have purchased a set of new bare metal login nodes, which should be more than fine to handle the currently workloads plus OOD.
Our slurm login nodes are really just for folks to login, maybe compile some things and kick off slurm jobs. Users use our GPFS storage with quotas in place, and we run the Arbiter software that policies resource usage and penalizes folks for using too much cpu/ram.
The sticky session thing you mentioned is a valid concern though, so I’ll rework my ansible code to configure login nodes to use their own hostname as the OOD url, and DNS round robin will land folks on a login node for OOD work.
As for ansible, fully on board with that. I have been working on a stack of ansible roles for well over a year now to move our slurm cluster including OOD, fastx etc from Centos 7 to redhat 9. Our current stuff was built and is maintained manually, I am working to fix that on a very large scale