We have recently added some new nodes to our cluster and can’t figure out why noVNC won’t connect to them. The SSH connection works fine and TurboVNC IS running on the node and the connection is setup and running when the user launches an interactive session.
We’re seeing the exact same output in the user’s log files on the working nodes as we are the broken nodes, with the exception of this on the broken nodes:
ERROR: NVIDIA driver is not loaded
ERROR: Unable to load info from any available system
We believe this is a red herring - only indicating that there are NVIDIA drivers installed but not being loaded, because we aren’t running OpenGL on these nodes.
We’ve changed the hostnames on these new nodes but the regex changes we made to the ood_portal.yml file seem to work fine as the FQDN is setup correctly in the connection URL and the SSH connection works when an interactive job is launched.
There are no errors in the logs on the OOD server itself, nor anything on the nodes. The vnc.log shows the same exact info on the working nodes and the broken nodes, right up to the connection part. It simply stops at this line on the broken nodes:
29/01/2020 09:17:00 VNC extension running!
I can’t say the new nodes are EXACTLY the same as the old nodes as we have started a new installation and configuration setup. However, we’ve reviewed everything we have in place for OOD on the old nodes and can’t find anything different on the new nodes - websockify and TurboVNC are setup exactly the same. So we’re trying to figure out what the VNC connection is doing at this point that it’s failing. Are there any temp files it might be creating? Maybe we have some restrictions on writing files on these new installations that we missed. Are there other places where OOD logs that we can check? We’ve looked at /var/log/* and on the OOD server and in the user’s OnDemand directory, as well as the node’s log files.
Thanks for any pointers you can provide!
Dori
UB CCR
To verify, have you already looked at the working directory of the jobs that failed for the job output?
The OnDemand interface is not that helpful in providing easy access to these after the job completes (fails) but when its queued there is a link you can click to open that directory in the files app:
Yeah the really odd thing is that the job setup when it’s queued looks exactly like the jobs that end up working (on the older nodes). Once it starts, it even looks the same until it goes to connect to that VNC session, then it terminates.
When you say “until it goes to connect to that VNC session, then it terminates.” what exactly happens? Do you see a NoVNC error? Does the batch job stop immediately after you try to connect? I would hope there was an error somewhere in the batch job error or output files afterwards.
Have you confirmed non VNC works such as Jupyter or RStudio?
One thing you could try to just verify there isn’t a problem with the proxying to the compute nodes of the new cluster is to start a simple server and verify you can connect to it:
Submit a job or start an interactive job and run this command:
To answer your first questions, the job is still running on the node but we get the error “failed to connect to server” and the job will just stay running like that until it’s canceled or time runs out. There’s nothing in the output files that’s any different than with the older nodes that work, except those two NVIDIA messages I included in my first email.
I meant to check a non-VNC app like jupyter and forgot. So you’re right in that none of this is working. When I try jupyter, the session starts but when I go to connect to it, I get: 404 not found. When I try your suggestion of testing this basic job, I get the same “404 not found” error also. I don’t know what to do to fix this though. I’m sorry for this basic question as it’s obviously not OOD related but do you have any suggestions?
You will definitely want to fix it first for the simplest basic job. Either something is mis-configured on the compute node side (iptables?) or on the OnDemand web host side (is OnDemand trying to connect to the wrong form of the hostname? Sometimes the string that hostname returns on the compute node does not match the host that OnDemand should be proxying to.
It turns out it is a regex issue. I only updated the ood_portal.yml file. I didn’t run the portal generator. I’m so sorry to have bothered you with this but very much appreciate this trick you shared as it definitely helped with the troubleshooting!
This is an example in OnDemand of a violation in of https://12factor.net/build-release-run, where you should be able to change the config without doing a build step! @tdockendorfopened a PR to regenerate the ood-portal.conf file whenever Apache is restarted. In this case, if you modified the config and restarted Apache it would have fixed it. Also, it seems that it would be ideal if we could pop a warning on the dashboard if the ood-portal.yml config changed but the Apache config did not have the corresponding changes. I’ll open an issue to address this.