Environmant
- OnDemand 2.0.32
- RHEL 8.8
- Slurm 23.02
We are attempting to run an Open OnDemand portal on each of our 3 login nodes to our HPC Cluster, all have the same shared /home file system.
Everything works fine, until a user switches to another login node, then all OnDemand Interactive sessions stop working (Currently have Desktop, RStudio and JuperLab installed).
For example, if I am logged into login1, startup and run an interactive job, do my work and exit. No problems, I can even start up a new job on login1, do my work, exit, no problem.
But if I then switch over to login2, start a new interactive job (like RStudio), the Slurm job starts on the cluster, but OnDemand doesn’t see the Slurm job as running:
Yet it is:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
164 debug sys/dash bviviano R 15:23 1 a0n11
If I connect directly connect to the compute node through the proxy (i.e. https://login2.domain.com/rnode/a0n11/PID), I can get the Jupyter or RStudio login page and manually enter the credentials stored in connection.yml, then I can get into the interface and use it fine.
So, that’s my quandry. Job 164 is running fine and I can connect into it through the OnDemand Proxy (if I enter the generated credentials manually), but I just don’t get the usual Connect Botton:
If I muck around and purge ~/ondemand, restart all the NGINX processes, etc. I eventually get back to a state where it works again, until I switch to another login node, then it stops working on all login nodes.
This feels like a caching issue, but I can’t find any error messages in the logs to help direct me and I am not sure where else to look.
Thanks.