Proxy Error when starting linux_host interactive applications

We are trying to set up an interactive desktop on the OOD host node using the linux_host adaptor. After some fits and restarts, we have the desktop up and running, but have one annoyance that we cannot figure out how to resolve.

We created a new (hidden) cluster using the linux_host adaptor type, specifying the hostnames as needed and provided the singularity parameters to launch a known good container. The ‘batch_connect’ section of the cluster definition matches what we have in other clusters to run the same container as a job on a compute node.

We created a new type of bc_desktop (called Login Node Desktop) that references the linux_host cluster definition and specifies the “vnc” template for batch_connect – and that’s it.

When selected, the appropriate page is shown to allow parameter changes (not that we have any yet, but …). When we hit the ‘Launch’ button, the session is started properly, but we are redirected to a bad URL that gives us a proxy error:

https://xx.xx.xx/pun/sys/dashboard/batch_connect/sys/bc_desktop/Login_Node_Desktop/session_contexts

When I launch other (compute node) desktops, it sends me to the interactive sessions page:

https://xx.xx.xx/pun/sys/dashboard/batch_connect/sessions

If I go to that page manually, the linux_host desktop session is shown and I can launch the vnc session by hitting the button there. The desktop session itself is fully functional.

My question is why the linux_host session does not directly send me to the interactive sessions page?

I’ve tried to follow the code, and ended up in the session_contexts_controller.rb file; but I’m not fluent enough with ruby to understand what is happening there and (if it is where the problem exists) what is causing the incorrect behavior. I’m suspicious that the session structure is missing some element that is present in the normal desktop sessions, but I don’t know what that might be.

Thanks!

Ed

Can you screen shot this error page and show us what’s being displayed? Also any errors /var/log/ondemand-nginx/$USER/error.log would be helpful.

I suspect there’s some sort of bug here where you’re able to launch and view the session, but something wonky is happening after you launch - because obviously the session launched.

I suspect it’s because it’s a hidden cluster. I’m guessing somehow that’s throwing all of this off. Like at one point the system accepts that it’s hidden and launches the job but at another it doesn’t like the fact that it’s hidden and fails for some for some reason.

It is the standard ‘Proxy Error’ screen that comes up periodically when the system is loaded and the NGINX server doesn’t respond fast enough …

Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request

Reason: Error reading from remote server

There is only the message in the error_log that references the incorrect destination:

 INFO "method=POST path=/pun/sys/dashboard/batch_connect/sys/bc_desktop/Login_Node_Desktop/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=302 duration=10848.46 view=0.00 location=https://xx.xx.xx/pun/sys/dashboard/batch_connect/sessions"

Yes, the session launches and everything works as expected … hopefully this is a simple fix :slight_smile:

That log line is a 302 to the sessions index which seems to be correct.

You’ll get redirected to a URL like this if the job fails to launch. Let’s say Slurm doesn’t like the account that you use to submit the job - just for example - you’d be redirected back to this page so that you can resbumit the form.

Now I know you say the job submitted successfully - but i still feel like the system doesn’t recognize that. Is there really nothing else in /var/log/ondemand-nginx/$USER/error.log? I find that very odd because an error like invalid response from an upstream server. isn’t a timing thing like nginx/the dashboard took too long (the error would be about timeouts) - this error is related more to corrupt data. Like nginx/the dashboard disconnected in the middle of the sending the response or only send part of the response and apache expected more to be sent.

Now I know you say the job submitted successfully - but i still feel like the system doesn’t recognize that.

Agreed – that is what is supposed to be happening in session_contexts_controller.rb, but something isn’t detecting the active session. I can’t tell what it is looking for to make that determination … but I think that if it didn’t find it, it would put me back in the page where I started the session – like what happens when a job doesn’t launch (for compute node interactive sessions).

Is there really nothing else in /var/log/ondemand-nginx/$USER/error.log ?

There is the line that launches the desktop with the ssh command and the ‘POST’ message like I quoted above. Nothing else. Is there some debug setting I can use to increase the verbosity?

My guess is that there’s some bug in the linux_host adapter while trying to read the ssh output. I’ve found that adapter to be very hard to debug - especially on a system where I can’t replicate the behavior.

Invalid responses seem to indicate that the app crashed entirely while writing the response. So it seems to me that there should be more in the logs there. Yes you get the 302 redirect, but what happens after that? I mean what’s the log line after that indicate?

There is nothing after that … not related to that action anyway … subsequent actions on the UI generate their entries, and I get the ‘Checking whether to disconnect long-running connections …’ message that came (in the most recent case) 5 minutes after the POST message.

I will say that once I’m back on the “my interactive sessions” page, the session is correctly identified as active, and when I shut down the desktop, it correctly identifies when that has happened and marks the session as completed. It’s just that it doesn’t recognize the session during the launch process.

Just to put a bow on this … now that we have upgraded to v3.1, the problem has disappeared, this can be closed.