We have been working to roll out ood on top of our existing slurm cluster.
We are now a rhel9-based shop so are using the ondemand-release-web-3.0-1.noarch.rpm repo.
Everything seemed install easy enough, including the integration with slurm so that we can use the ssh/shell app to our submit node except for one persistant issue:
no matter what we do, after exactly 60 seconds of idleness in the shell app, the connection closes with “Your connection to the remote server has been terminated.” keeping the shell app active with ‘top’ keeps it from disconnecting. We suspect this is the websocket getting unhappy. Here are the things we’ve investigated that have not helped:
- we first thought it was ssh, but ssh is not timing out. Our ssh clientalive values are set to handle idleness and we dont see any issues when idling using other clients.
more conclusively, even after the “Your connection to the remote server has been terminated.” in the browser, I can see the process for ssh is still active for some time, so ssh is not dead
zack.ra+ 59985 59857 1 16:42 ? 00:00:02 Passenger RubyApp: /var/www/ood/apps/sys/dashboard (production)
zack.ra+ 60052 59857 0 16:43 ? 00:00:00 Passenger NodeApp: /var/www/ood/apps/sys/shell
zack.ra+ 60095 60052 0 16:43 pts/0 00:00:00 ssh access.hpc.vai.org
- We then figured it was an nginx timeout issue. To test this we see that each user gets a generated /var/lib/ondemand-nginx/config/puns/username.conf
within that conf is an
include /var/lib/ondemand-nginx/config/apps/sys/*.conf;
so we added the following to /var/lib/ondemand-nginx/config/apps/sys/shell.conf
location ~ ^/pun/sys/shell(/.*|$) {
proxy_read_timeout 600s;
proxy_connect_timeout 600s;
proxy_send_timeout 600s;
uwsgi_read_timeout 600s;
uwsgi_connect_timeout 600s;
uwsgi_send_timeout 600s;
...
this did not make any difference, including various permutations of the above timeout values. running nginx manually with my personal config:
/opt/ood/ondemand/root/usr/sbin/nginx -c /var/lib/ondemand-nginx/config/puns/zack.ramjan.conf -T
shows that the timeout settings are being read by nginx.
- We tried to add a ping to the shell app in /var/www/ood/apps/sys/shell/app.js in hopes that it keeps the websocket alive.
wss.pingInterval = setInterval(() => {
ws.ping();
},4000);
This just made the app error out after 4 seconds, ie we received the “Your connection to the remote server has been terminated.” right after the ping. Admittedly, I have no idea what I’m doing here, but thought it was worth a shot.
But it also made me wonder if rather than an actual time out, there was some event that was occuring at the 60 second mark that was causing a failure and the session to die.
watching the apache and nginx user logs didnt seem to have anything interesting.
Any advice greatly appreciated as we are looking forward to what ood can mean for our hpc users.