Interactive Sessions timeout and are unable to reconnect infrequently

Good Afternoon,
We are having issues with timeouts on interactive sessions where if a user lets the session sit for a variable period of time, the sessions becomes disconnected:
“something went wrong” on noVNC windows, and then if I retry connect it will print out

“new connection has been rejected, reasons: Authentication failed”

This is a big issue for some of our most common users and we aren’t able to pin down why this is happening inconsistently.
Is there anything I can provide to help get this resolved?

I think that’s by design, that the session get’s stale and resets the password.

But you should be able to go back to ‘Interactive Sessions’ page and click the connect button on the card for the job and you should be able to reconnect. Which is to say the same URL won’t work a second time, but you should be able to reconnect through the card if the job is still running.

Okay, so the expected workflow is, after disconnect; close extra window; select connect again.

Correct? I’ll test it for them and see if we just need to post a FAQ

Yes. If you truly get locked out of the session that’s a bug. But you should be able to reconnect through the card as many times as you want.

Hi Jeff, our users are experiencing the same issue, we’re still on v1.6. The timeout frequency varies a lot, sometimes it occurs every few minutes while the user has been recently active, other times it can be okay for 45+ min. Not sure if related but the VNC timeout also seems to kill any “terminal in the browser” sessions (not in the VNC session). Terminal sessions however without a VNC session running seem to work fine. Based on that I’m guessing there’s some interruption of the connection between the PUN and the browser then?

Are these limits settable? Googling this turns up lots of seemingly irrelevant hits and vague ideas like “websockets are finicky” and the like–that doesn’t help us though. :slight_smile: We are connecting over a relatively slow VPN from sometimes long distances, not sure if that’s relevant. Would expect the VNC connection to only be talking to the PUN in the data center (same room as the cluster nodes), not the actual remote client (laptop) though right?

Hi and welcome @reklar!

There seem to be a couple things going on here.

The first, the terminal has a 5 minute timeout setting, from Passenger, which we do not currently allow for configuration. The 5 minutes being the default.

To the second VNC issue - VNC connections are from the client (the laptop wherever it is) to the compute node that’s running the desktop session. They bypass the PUN once they’ve got the initial html from the PUN.

If you’re having to connect over long distances and through a slow VPN there may be some things we can toggle like extra retries, but I don’t know off the top of my head. So I’ll have to look more in depth for you on that.

Right, that makes better sense. We have people trying to connect to these over potentially shaky home internet, sometimes from Europe/UK to the US and through an also potentially slow VPN. Any steps we can take to make these turbovnc connections more robust/stable would be great.

The terminal seemed to be getting killed at the same time as the VNC session. When we killed the remote desktop slurm job the terminal remained stable for considerably more than 5 minutes. I suppose that could have been a coincidence?

I’m sure you guys are busy with the 2.0 release, but gentle reminder ^^^ for when you get some time. Thanks much!

This issue with sessions failing over our WAN is still occurring. We tried adding --heartbeat=30 as described in the thread below but it doesn’t seem to have helped: noVNC and shell session timeout after 1 minute

Any advice on how to make these websocket connections more stable/persistent would be very much appreciated.

You can now increase the pool_idle_time in the most recent 2.0 version.