I have a question: if I configure the OpenOnDemand web server as a backend server (e.g., to launch a local desktop session) users are disconnected and reconnected at intervals equal to the POLL_DELAY time. This behavior doesn’t occur if interactive sessions are launched on a remote server.
I’m not sure why that would be the case. We’re proxying requests to an ip, and I don’t believe there’s any difference between 127.0.0.1 and any other ip - at least not one I can imagine.
By ‘disconnected’ you mean they’re disconnected from the desktop?
No, I didn’t explain myself well. Interactive sessions remain active; there’s only one disconnect-reconnect events in the log with the POLL_DELAY interval. This doesn’t happen with a shell.
Dec 17 18:00:39 crescof-nvi1 sshd[23136]: Authorized to lbuccip, krb5 principal lbuccip@ENEA.IT (ssh_gssapi_krb5_cmdok)
Dec 17 18:00:39 crescof-nvi1 sshd[23136]: Accepted gssapi-with-mic for lbuccip from 192.107.54.6 port 35632 ssh2
Dec 17 18:00:39 crescof-nvi1 sshd[23136]: pam_unix(sshd:session): session opened for user lbuccip by (uid=0)
Dec 17 18:00:39 crescof-nvi1 sshd[23160]: Received disconnect from 192.107.54.6 port 35632:11: disconnected by user
Dec 17 18:00:39 crescof-nvi1 sshd[23160]: Disconnected from 192.107.54.6 port 35632
Dec 17 18:00:39 crescof-nvi1 sshd[23136]: pam_unix(sshd:session): session closed for user lbuccip
Dec 17 18:00:39 crescof-nvi1 sshd[23165]: pam_exec(sshd:session): execve(/afs/enea.it/common/setup_user_pfs.sh,…) failed: Permission denied
Dec 17 18:00:39 crescof-nvi1 sshd[23136]: pam_exec(sshd:session): /afs/enea.it/common/setup_user_pfs.sh failed: exit code 13
Dec 17 18:00:39 crescof-nvi1 sshd[23136]: pam_exec(sshd:session): conversation failed
Dec 17 18:00:40 crescof-nvi1 sshd[23178]: Authorized to lbuccip, krb5 principal lbuccip@ENEA.IT (ssh_gssapi_krb5_cmdok)
Dec 17 18:00:40 crescof-nvi1 sshd[23178]: Accepted gssapi-with-mic for lbuccip from 192.107.54.6 port 35636 ssh2
Dec 17 18:00:40 crescof-nvi1 sshd[23178]: pam_unix(sshd:session): session opened for user lbuccip by (uid=0)
Dec 17 18:00:50 crescof-nvi1 sshd[23202]: Received disconnect from 192.107.54.6 port 35636:11: disconnected by user
Dec 17 18:00:50 crescof-nvi1 sshd[23202]: Disconnected from 192.107.54.6 port 35636
Dec 17 18:00:50 crescof-nvi1 sshd[23178]: pam_unix(sshd:session): session closed for user lbuccip
Dec 17 18:00:50 crescof-nvi1 sshd[23959]: pam_exec(sshd:session): execve(/afs/enea.it/common/setup_user_pfs.sh,…) failed: Permission denied
Dec 17 18:00:50 crescof-nvi1 sshd[23178]: pam_exec(sshd:session): /afs/enea.it/common/setup_user_pfs.sh failed: exit code 13
Dec 17 18:00:50 crescof-nvi1 sshd[23178]: pam_exec(sshd:session): conversation failed
Dec 17 18:02:31 crescof-nvi1 sshd[25861]: Authorized to lbuccip, krb5 principal lbuccip@ENEA.IT (ssh_gssapi_krb5_cmdok)
Dec 17 18:02:31 crescof-nvi1 sshd[25861]: Accepted gssapi-with-mic for lbuccip from 192.107.54.6 port 35814 ssh2
Oh i see. I’m not 100% sure what’s happening here. Desktops don’t open ssh terminals forever, they just ssh to retrieve the status of the job (the tmux session) issuing just 1 simple script then disconnecting. So any ssh connection made from OnDemand in this context (a desktop session) isn’t persistent anyhow.
The POLL_DELAY here may just be a coincidence that it and some sshd setting are both 60 seconds.
What’s the end user behavior you’re seeing? I.e., what happens to the end user?
The end user doesnt’t see any effect … the only side effect is the periodic connection-disconnection in the log. This event has the same intervall as the POLL_DELAY value (changing it changes the frequency af the events!).
In summary:
an interactive session on a remote backend server doesn’t exibit this behavior
a shell app on the same webserver (OOD) triggers only one event
an interactive app on the same webserver (OOD) triggers this event with time interval equal to POLL_DELAY
… I don’t Know if it depends on GSSAPI-KERBEROS configuration
I don’t think this is an issue. Here’s a list of details on what’s going on:
POLL_DELAY defines how often we poll for the status of jobs.
For real schedulers like Slurm we use squeue to do this work.
For linuxhost type schedulers, we ssh into the machine to query tmux for the status of the “job”. (job here in quotes because it’s not like a job in the Slurm scheduler)
This is what you’re seeing in the logs. The user sshing to view the status of the “job” at the POLL_DELAY interval. It’s not a persistent ssh connection, it just sshes to get the status then disconnects.
This is more or less the explanation I was given.
One more detail:
1) on the web server the job run under LSF
2) on the remote server the scheduler is SLURM
Thanks,
Luigi