I noticed a peculiarity now that I’ve got a second instance of OOD up and running here. Since we have shared home directories across all of our clusters, it looks like things I created from the other instance show up in the other, and things get weird:
One lesson I learned is you should not name all your clusters “cluster.yml”. This originally showed the halstead jobs showing up as Gilbreth, until I changed those names.
I’m also seeing something similar in the Interactive Apps:
I’m not quite sure what the best approach should be for handling multiple OOD instances on a shared home directory? A clean approach could be not show them at all. Just hide them if $cluster.yml is not defined. Or make resource specific ~/ondemand directories.
A slicker approach could be to define a dummy “cluster.yml” that just redirects you over to the other instance for that cluster. But I wouldn’t necessarily want it to be advertised. We have to be careful, cluster access is not universal here (a solution to that would also be slick!). In theory, I believe PBS commands on one of our clusters could talk to the other, however should we start moving to another job manager it won’t be all at once, so making the systems cross-talk gets difficult quickly. So I think the redirect to another instance would be the preferred approach for me.
Is there any approach off-hand for addressing these issues?
Yes indeed there is a feature for that, but it wasn’t what I expected. The way OSC does it is there is an undocumented ACL feature (Github Issue to add docs) in the schema config.
The short of it is that your cluster config should look something like this:
title: "Definitely Not cluster.yml"
acls: # <== ACLS!
- adapter: "group"
I literally just learned that this was a thing and will be writing it up tomorrow. How I expected it to work, and a method that we will try to support in a future version (version > 1.5), is through file read permissions on the cluster config file tied to a group membership. But, when I tried that my Dashboard threw a 500 error because we do not gracefully handle read permissions errors on cluster configs, so don’t try that at home.
Ah, nifty! FWIW, we use ldap attributes for cluster access (and use the appropriate line for that in the CAS/apache config).
So messing around a bit, it seems like a minimal configuration like this:
And setting hidden to true seems to hide it from the Shells dropdown and the selection in the job composer. And then in your interactive sessions it correctly resolves the job status (rather than the “bad state”). Amazingly, I can seem to cross-connect to interactive sessions (thought I had the reverse-proxy stuff ACL’d to just the cluster domain, but maybe I had it one level up for simplicity).
I think this’ll work for now. I’m pretty sure it works because each of our Torques can talk to each other, so I still worry when this may not be the case in the future. I’d hate to have to install, say, Slurm binaries alongside Torque binaries should we build a new cluster on Slurm, but I suppose that might be doable.
I am not sure what the consequences of using the hidden attribute for access control are, but if that is good enough for you then great. We do not currently support LDAP-based ACLs for the cluster.
As for running heterogenous clusters; OOD supports it. I have a testing environment where I have Slurm and GridEngine installed side by side.
Also I have an open PR for mentioning that group-based ACLs are a feature in OOD’s cluster configurations: https://github.com/OSC/ood-documentation/pull/136
I can explain the hidden attribute. We use that at OSC to hide the “Quick cluster” which we run certain interactive jobs on (so users using OnDemand can submit those jobs) but do not want to expose as a “proper cluster” i.e. it doesn’t appear as an option to submit jobs to in the Job Composer, and we use it to filter it out Quick from the list of clusters to show in our custom system status app. The cluster does, however, appear in the list of clusters in Active Jobs.
Ah, there’s the gotcha! I added a couple if statements in the active jobs app so that it would honor that, and that seems to help. Now I wonder if I can do the same for the interactive apps list…
Yeah, I was able to find the right place in the dashboard to add the filter by
metadata.hidden field. Still need to do same for job composer and do more testing to make sure I didn’t goof anything else up. This should do the trick if it is working as intended.
Sorry I failed to read the main topic of discussion when I read the comment about the hidden attribute.
We haven’t documented it well, but in nginx_stage there is an OOD_PORTAL environment variable that can be set. This will actually change the default data root locations for the apps. The default value is “ondemand” but it can be changed. If you want a second OOD installation to have its own namespaced data, this is an approach you can take and what we do. We have two different OOD portals, an “AweSim” portal with the OOD_PORTAL=“awesim” and an “OSC OnDemand” portal with the default OOD_PORTAL. The result is when launching job composer from AweSim OOD, the data goes under ~/awesim/data/… instead of ~/ondemand/data…
If this approach is interesting to you I can look to provide more detailed/careful instructions for how to set that up.
Just to follow up on this, I changed the OOD_PORTAL variable during 1.5 upgrade on our first prod instance this morning, rather than try backport all my modifications. Seems to work well so far. Any considerations than just changing the var to something unique? I meant to follow up on this earlier but slipped my mind.
One of the difficulties of having an app support job records from two different OnDemand installs with two different cluster configs for two different clusters, is that when an app has a job record with a cluster specified that “doesn’t exist”, there are now three possibilities for why it doesn’t exist:
- the cluster actually doesn’t exist because it was decommissioned and removed
- the cluster does exist, but you aren’t meant to use it from that OnDemand instance
- the cluster does exist, but you can’t use it from that OnDemand instance
You point out correctly that your situation is #2 and the temporary fix was to add a cluster config for the other cluster - so now the status of that job can be found. But in the future you correctly fear #3:
I’m pretty sure it works because each of our Torques can talk to each other, so I still worry when this may not be the case in the future. I’d hate to have to install, say, Slurm binaries alongside Torque binaries should we build a new cluster on Slurm, but I suppose that might be doable.
Maybe the solution is to define a
noop job adapter type. I don’t know what we called it, but for the sake of argument, imagine you are configuring your OnDemand instance that serves your Torque cluster, and you add a “noop” for your Slurm cluster:
title: "My Slurm Cluster"
Then the apps would disable interacting with a job whose cluster was with a noop cluster, but still show the record. Or the apps would just hide the record altogether.
Or maybe there is a configuration option that tells the apps to treat any job with a “cluster” like it would as described in my noop example: just hide the records, don’t try to check the status of the records. Or maybe that configuration, instead of being a boolean, is a comma delimited string of cluster id’s.