We are working on rebuilding our cluster with new hardware and networking. The idea was to have both clusters running simultaneously and to let users test on the new cluster before we begin moving all the rest of the nodes to it. We also spun up a new OOD latest instance (required for RHEL 9.1). Our old instance is OOD 1.8. Home directories are shared between the two.
What I’m now running into is when I launch an interactive app on one of them, then delete it and log out and go to the other instance and try to launch an interactive app, the app instantly ‘completes’, even though the job is running on the slurm scheduler. This happens in both directions. It seems to resolve when I completely turn off OOD on the side where it was working then try again on the second side.
It seems to me like the only thing that could be causing that is the ~/ondemand directory shared between them. But lsof on either side doesn’t show any open file handles in ondemand. I thought maybe the nginx per-user process might be holding something like a session socket or something, but it doesn’t seem that way.
Any thoughts on this would be greatly appreciated.
Thanks for your post. I’ve looked through the code and through documentation. I’m unable to find any functionality that allows you to change your home directory.
We think we found a solution. This occurs under the circumstances if the user still has a browser (even if idle or different browser) still open to the other instance of OOD. It seems to be some interaction between the sessions and shared home directory.
Yes, your intuition is correct. The two instances conflict with each other. Let’s call them clusters A and B. When you schedule a job on A then OnDemand queries B for that job id and doesn’t find it, it assumes it’s complete and marks it as such.
You have a couple options, though none are without some work. I’d suggest the first as it’s the cleanest for your users.
Rename the cluster on B. Let’s call it owens. So you have /etc/ood/clusters.d/owens.yml on both systems. You could rename the one on B to /etc/ood/clusters.d/owens-upgrade.yml. This’ll make OnDemand see them as different schedulers entirely so they’ll stop conflicting with each other. Though this’ll require you update your apps to use both clusters.
You could use different ondemand_portal settings. This defaults to ondemand so it creates ~/ondemand directories. If you set B’s ondemand_portal to say owens-new, then you’d start to create directories under ~/owens-new. Though this may get you into other trouble sharing data. For example the job composer data/jobs/etc won’t be shared which could be unanticipated for your users. nginx_stage.yml — Open OnDemand 2.0.20 documentation