Multiple Open OnDemand portals, for the same cluster?

Environmant

  • OnDemand 2.0.32
  • RHEL 8.8
  • Slurm 23.02

We are attempting to run an Open OnDemand portal on each of our 3 login nodes to our HPC Cluster, all have the same shared /home file system.

Everything works fine, until a user switches to another login node, then all OnDemand Interactive sessions stop working (Currently have Desktop, RStudio and JuperLab installed).

For example, if I am logged into login1, startup and run an interactive job, do my work and exit. No problems, I can even start up a new job on login1, do my work, exit, no problem.

But if I then switch over to login2, start a new interactive job (like RStudio), the Slurm job starts on the cluster, but OnDemand doesn’t see the Slurm job as running:

Yet it is:

$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
164 debug sys/dash bviviano R 15:23 1 a0n11

If I connect directly connect to the compute node through the proxy (i.e. https://login2.domain.com/rnode/a0n11/PID), I can get the Jupyter or RStudio login page and manually enter the credentials stored in connection.yml, then I can get into the interface and use it fine.

So, that’s my quandry. Job 164 is running fine and I can connect into it through the OnDemand Proxy (if I enter the generated credentials manually), but I just don’t get the usual Connect Botton:

image

If I muck around and purge ~/ondemand, restart all the NGINX processes, etc. I eventually get back to a state where it works again, until I switch to another login node, then it stops working on all login nodes.

This feels like a caching issue, but I can’t find any error messages in the logs to help direct me and I am not sure where else to look.

Thanks.

You have 3 login nodes, but do you have 3 HPC Clusters? You can run one OnDemand instance and connect to many clusters (that’s how OSC runs our deployment). You simply create several cluster.d files.

The issue you seem to be facing is that you seem to have 3 distinct clusters you’re submitting jobs to. Is that correct? If so, that’s your issue. You have 3 clusters and they’re all named the same thing. Here’s what’s happening.

  • login to login1 and start a job on clusterA with job id 164.
  • login to login2 and OnDemand tries to find the job 164 on clusterA. You’ve named this OnDemand cluster clusterA, but really it’s a physically different cluster than what login1 connects to. As such, job 164 doesn’t exist and so we (OnDemand) assume that because it doesn’t exist, it must have completed. We do not assume that it was from some other cluster, we just assume that it’s complete because squeue or similar can’t find the job anymore.
  • the login2 OnDemand marks it as complete because it can’t find the job.

No, we have 1 HPC Cluster (One Slurmctld instance) running RHEL8, but with 3 login/interactive/submit nodes, all 3 login nodes share the same /home file system (as do all the compute nodes). We maintain 3 login nodes for redundancy, so if userA does something bad/stupid on login1 and knocks it offline, there are still 2 other login nodes available.

Under RHEL7, we had the same setup with 3 login nodes, but only ran OnDemand one 1 of the login nodes. That’s worked fine for a few years now, but has sometimes been problematic, if say enough users where hitting the X11 Desktop all at the same time (for example), causing all the Desktop’s to be less responsive because of network delay.

With the switch to RHEL8, and increased popularity of OnDemand at our site we want to run OnDemand on all 3 login nodes and let users pick which of the 3 login nodes they want to hit.

As I said, everything works fine, but for some reason when I switch from login1 → login2, then OnDemand gets confused and doesn’t see the job start. Watching the NGINX logs, I see where it reports the job as completed, before it tries to run the squeue command. So it’s almost like a race condition, but I can’t quite seem to figure out where/how.

Only fix I’ve found so far is to set

OOD_PORTAL=ondemand-login[1-3]

in the dashboard env on each of login[1-3], so that each portal has its own separate state directory under $HOME. Not ideal, but workable.

If there is something else I am missing, please let me know.

Thanks.

Sorry for the delay, I’ve been in a summer camp and took a week vacation.

Thanks for the details. There’s something obvious that we’re missing. Your setup on the surface should just work without having to modify the OOD_PORTAL environment variable.

We have a similar setup with our dev, test & production instances. They all point to the same cluster and they all have the same HOME.

Do you have the same exact cluster.d files on all 3 instances (and slurm configuration files for that matter)?

This does work fine in our environment, last I checked.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.