Having successfully deployed an OOD 1.5.5 instance in a sandbox on a VM, I am now trying to do the same in our HPC environment. I’m having an issue with the Interactive Desktop, specifically: the environment variable settings I have defined in my cluster configuration in /etc/ood/config/clusters.d are somehow not making their way into the slurm_script that runs the desktop on the compute node, and as a result, the desktop session attempts to find websockify at the default location (/opt/websockify/run), where it is not found, so though I can see all the Mate desktop processes running, in the absence of a socket, I am unable to connect.
Here is the cluster configuration in my HPC environment:
Besides the hostname, the only difference in our (working) sandbox is that python (v3) is installed directly from an RPM rather than being loaded as a module. When we start a desktop in the sandbox environment, the job_script_content.sh produced in the user’s output log starts with the lines from the cluster config:
Hi, thanks for the all the details in your question! This works in your sandbox but not in your HPC (production?) environment, so that’s a clue. As a quick spot check be sure they’re the same versions. 1.5.5 isn’t that old, but it’s not that new either. (1.6.20 is latest just FYI).
You’re yml looks good. I copied it and read it and it loaded correctly.
But my guess is, if the libraries could have read the yml correctly they would written the job shell correctly (as they do in your lab sandbox). The library doesn’t interpret it in any way, so it’s actual content is only meaningful as it relates to yml parsing.
Can you do a quick diff on your cluster configs between the two environments? Sometimes it may be something as simple as a yml indentation issue. We use single lines like below. Maybe you can try that as a test instead of the |?
Obviously there’ll be some quoting issues, but my guess is it’s something simple like a yml parsing issue. The library can’t read it and just discards that portion.
Also, it looks like you can set the websockify command directly with this parameter. Though, this will only get you half of the way there. You still won’t get the other bits out of the script_wrapper which isn’t happening for you now.
Thank you. I did catch this error yesterday morning after a good night’s rest. Though the syntax fix alone wasn’t sufficient to get the configuration working, I have a feeling it wouldn’t be working without it.
Bringing the single lines into one with the \n separators, along with fixing the “batch_connnect” misspelling, seems to have worked – the environment settings are now being logged, and websockify is now running on the client.
Something strange happened, though, after I made the change: I started seeing Slurm connectivity errors, and my batch jobs weren’t being submitted. I had seen such errors before, so I had a general idea where to look for the problem. Our HPC (production) environment has but one large cluster, so we have never named it in our Slurm configuration, rather we just use the default “slurm_cluster” name. OnDemand doesn’t seem to deal with this very well, at least not consistently. I first tried removing the “cluster: slurm_cluster” definitions from both configuration files (in cluster.d and in apps/bc_desktop). After that, I was able to submit Slurm jobs (thus I was able to start the desktop processes on my client), but I received errors when I attempted to show job status within OOD. To fix that, I had to put the “cluster: slurm_cluster” back in the configuration in apps/bc_desktop. Both features now appear to do what they are intended, showing job status and also starting Mate desktop processes and websockify on the client.
Unfortunately, I still can’t connect to my desktop.
Thanks so much for all your help. Among open source user communities I’ve encountered, this is one of the best I’ve seen so far with regard to responsiveness. Politeness, too. Your assistance is greatly appreciated.
I Imagine it was just the batch_connect misspelling that was the issue. If you want to move back to | I think that’d be fine, indeed it’s a lot more readable that way. That was just a suggestion to try. The misspelling was for sure the only problem.
To the error described - do you use CAS for authentication by any chance? A google search of that error actually showed this thread that has that exact error in it (in the post above the one linked). Looks like they needed to add CASScope to their apache configs.
I just discovered that if I edit the URL in my browser to reflect the FQDN of my client instead of the short name, noVNC is able to connect. That’s a little odd, too, since DNS is configured such that using either will resolve the same IP address, but I feel like I’m on the right track.
It’s working now. As mentioned in your repost, I just had to change the host_regex back to the default ‘[^/]+’. I’d had it set that way before, but in my effort to get it working, I’d experimented with a few different settings and apparently never changed that one back.
Again, thanks much to both of you for your help. You guys are the best!
Glad it is working! I recommend trying to determine a host regex that captures the hosts you allow without allowing just every host. The idea of the regex is to limit requests from authenticated users through the “dumb proxy”, which just uses the host and the port embedded in the URL from the user to determine which backend server to proxy to i.e. /rnode/HOST/PORT. If changing the host_regex back to the default fixed the problem, that means of coures that the host_regex was too restrictive, but there may be a less restrictive one that is still preferable to any host at all. See the tip, warning, and danger boxes in this section of the documentation: https://osc.github.io/ood-documentation/master/app-development/interactive/setup/enable-reverse-proxy.html#steps-to-enable-in-apache