Shell app dies exactly after 60 seconds of idleness

We have been working to roll out ood on top of our existing slurm cluster.
We are now a rhel9-based shop so are using the ondemand-release-web-3.0-1.noarch.rpm repo.

Everything seemed install easy enough, including the integration with slurm so that we can use the ssh/shell app to our submit node except for one persistant issue:

no matter what we do, after exactly 60 seconds of idleness in the shell app, the connection closes with “Your connection to the remote server has been terminated.” keeping the shell app active with ‘top’ keeps it from disconnecting. We suspect this is the websocket getting unhappy. Here are the things we’ve investigated that have not helped:

  1. we first thought it was ssh, but ssh is not timing out. Our ssh clientalive values are set to handle idleness and we dont see any issues when idling using other clients.

more conclusively, even after the “Your connection to the remote server has been terminated.” in the browser, I can see the process for ssh is still active for some time, so ssh is not dead

zack.ra+   59985   59857  1 16:42 ?        00:00:02 Passenger RubyApp: /var/www/ood/apps/sys/dashboard (production)
zack.ra+   60052   59857  0 16:43 ?        00:00:00 Passenger NodeApp: /var/www/ood/apps/sys/shell
zack.ra+   60095   60052  0 16:43 pts/0    00:00:00 ssh access.hpc.vai.org
  1. We then figured it was an nginx timeout issue. To test this we see that each user gets a generated /var/lib/ondemand-nginx/config/puns/username.conf

within that conf is an
include /var/lib/ondemand-nginx/config/apps/sys/*.conf;
so we added the following to /var/lib/ondemand-nginx/config/apps/sys/shell.conf

location ~ ^/pun/sys/shell(/.*|$) {
  proxy_read_timeout 600s;
  proxy_connect_timeout 600s;
  proxy_send_timeout 600s;
  uwsgi_read_timeout 600s;
  uwsgi_connect_timeout 600s;
  uwsgi_send_timeout 600s;
...

this did not make any difference, including various permutations of the above timeout values. running nginx manually with my personal config:

/opt/ood/ondemand/root/usr/sbin/nginx -c /var/lib/ondemand-nginx/config/puns/zack.ramjan.conf -T

shows that the timeout settings are being read by nginx.

  1. We tried to add a ping to the shell app in /var/www/ood/apps/sys/shell/app.js in hopes that it keeps the websocket alive.
wss.pingInterval = setInterval(() => {
	ws.ping();
},4000);

This just made the app error out after 4 seconds, ie we received the “Your connection to the remote server has been terminated.” right after the ping. Admittedly, I have no idea what I’m doing here, but thought it was worth a shot.

But it also made me wonder if rather than an actual time out, there was some event that was occuring at the 60 second mark that was causing a failure and the session to die.


watching the apache and nginx user logs didnt seem to have anything interesting.

Any advice greatly appreciated as we are looking forward to what ood can mean for our hpc users.

I suspect what you’re seeing requires a code change, that there’s no setting for this.

I’ll open and/or find a ticket upstream for the same. But again, I suspect that it’s apache’s 60 second timeout and that we’re not ping/ponging to the server to keep the connection alive.

1 Like

Ok thanks, we are happy to help with testing etc.

I found this ticket which I just scheduled for the 3.1 release.

I think I have a fairly simple fix that seems to be working.

create new conf file with the following in the apache conf.d:

cat /etc/httpd/conf.d/proxytimeouts.conf

TimeOut 600
ProxyTimeout 600
KeepAlive On
KeepAliveTimeout 600

this will globally set various apache timeout conf values that seem to prevent 60s disconnects

Its likely that not all of the above are needed. In my few tests, the connection went for longer than 600s before dying (appeared to die after ~1000seconds). I will try to narrow down what config options are actually helping after the weekend.