Interactive apps are disconnecting very frequently

Hello,

OnDemand version: 2.0.28
OS Red Hat 8.4
The interactive apps we run are actively disconnecting. You can re-launch and resume.
The issue is that it is happening very frequent and at times within minutes of launching your new tab for the remote session, you get noVNC disconnection and sometimes it could be an hour or more before it disconnects you.

I have read through posts and added --heartbeat=30 to the /opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.28/gems/ood_core-0.21.0/lib/ood_core/batch_connect/templates/vnc.rb and also set passenger_pool_idle_time: to 8 hours in the nginx_stage.yml and the problem is still happening.

I stood up an identical instance and while testing all worked as expected, and interactive sessions remained connected until the job completed. After I cut over our production to the new instance the problem started to happen again. The one correlation that kinda stands out is that in testing we only had a few users 15 or less, but in production we have 180+ unique users connected.

Thanks
Vasile

Hey, sorry for the trouble.

A few questions:

  1. Have you made any changes to the before.sh.erb
  2. For the change you made to ood_core, what was the line you inserted the --heartbeat=30 on so I can see what exactly was passed.
  3. Is there anything between the web-node and the login-nodes at all?
  4. Could you look in /var/log/ondemand-nginx/<user> when this happens to see if anything jumps out?

If you could answer those it would make trouble-shooting much easier. Thanks!

Hello,

Thanks for the help :smiley:

before.sh.erb has the following contents in it on all our apps:

# Export the module function if it exists
[[ $(type -t module) == "function" ]] && export -f module

The exact line from the ood_core that was modified:(line 140)
#{websockify_cmd} -D ${websocket} --heartbeat=30 localhost:${port}

Not sure I understand the question. Our setup is such that our login servers port forward web traffic to the OnDemand instance for the corresponding cluster, so when a user puts our cluster URL in a web browser they are taken to the OnDeman instance and if they use it in a terminal it will connect them to our login nodes via SSH.

Only thing that stood out was this:

[ N 2022-10-31 20:06:45.6880 173669/T24 Ser/Server.h:558 ]: [ServerThr.35] Shutdown finished
[ N 2022-10-31 20:06:45.6880 173669/T28 Ser/Server.h:902 ]: [ApiServer] Freed 0 spare client objects
[ N 2022-10-31 20:06:45.6881 173669/T28 Ser/Server.h:558 ]: [ApiServer] Shutdown finished
[ N 2022-10-31 20:06:45.6883 173669/T1 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 173762, application /var/www/ood/apps/sys/dashboard (production)
[ N 2022-10-31 20:06:46.7920 173669/T1 age/Cor/CoreMain.cpp:1325 ]: Passenger core shutdown finished

We have OnDeman running on 3 clusters and this is happening only on 1 of the clusters. The other clusters only have a fraction of the traffic.
Additional info, this started to happen after our summer maintenance where we went from CentOS 7.9 to Red Hat 8.4.

Thanks
Vasile

Ok, thanks for the info.

It is interesting that the OS changed, but I haven’t found much on that yet.

Are you seeing anything in the actual access.logs when this happens for the OnDemand web-node? Or anything in the error.logs might be interesting as well as it may be the web-side.

But, I’d be curious to see what is happening on the other side, and what that noVNC log has going on. I’m going to play with this a bit more to see if I can replicate and find something to clue me in, but right now I’m not sure what would be causing this.

Hello,

Thanks, I’ve set it up so that all the jobs I launch are ending up on the same node. I noticed that websockify has a --log-file= option.
If I add that to the /opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.28/gems/ood_core-0.21.0/lib/ood_core/batch_connect/templates/vnc.rb , will that change by dynamic or do I need to run through the portal generator and httpd restart?

Thanks
Vasile

You’ll have to do an httpd restart, think of it as adding an option to the configs themselves and issuing the restarts. The portal generator won’t see the change as you already have that in the code, just needs to get picked up in memory.

Adding some of this to the API might be nice. We already have one in for the heartbeat being something that can also be added for the websockets so people won’t have to mess with the core code.

Also, any luck seeing the vnc.logs on any of the sessions to see anything there?

Hello,

The vnc.log just has the following:
Starts with this:

TurboVNC Server (Xvnc) 64-bit v2.2.5 (build 1.el8)
Copyright (C) 1999-2020 The VirtualGL Project and many others (see README.txt)
Visit http://www.TurboVNC.org for more information on TurboVNC

02/11/2022 11:16:36 Using security configuration file /opt/TurboVNC/etc/turbovncserver-security.conf
02/11/2022 11:16:36 Enabled security type 'tlsvnc'
02/11/2022 11:16:36 Enabled security type 'tlsotp'
02/11/2022 11:16:36 Enabled security type 'tlsplain'
02/11/2022 11:16:36 Enabled security type 'x509vnc'
02/11/2022 11:16:36 Enabled security type 'x509otp'
02/11/2022 11:16:36 Enabled security type 'x509plain'
02/11/2022 11:16:36 Enabled security type 'vnc'
02/11/2022 11:16:36 Enabled security type 'otp'
02/11/2022 11:16:36 Enabled security type 'unixlogin'
02/11/2022 11:16:36 Enabled security type 'plain'
02/11/2022 11:16:36 Desktop name 'TurboVNC: gl3035.arc-ts.umich.edu:31 (vnegrea)' (gl3035.arc-ts.umich.edu:31)
02/11/2022 11:16:36 Protocol versions supported: 3.3, 3.7, 3.8, 3.7t, 3.8t
02/11/2022 11:16:36 Listening for VNC connections on TCP port 5931
02/11/2022 11:16:36   Interface 0.0.0.0
02/11/2022 11:16:36 Framebuffer: BGRX 8/8/8/8
02/11/2022 11:16:36 New desktop size: 1240 x 900
02/11/2022 11:16:36 New screen layout:
02/11/2022 11:16:36   0x00000040 (output 0x00000040): 1240x900+0+0
02/11/2022 11:16:36 Maximum clipboard transfer size: 1048576 bytes
02/11/2022 11:16:36 VNC extension running!

then repeating (due to multiple drops I suspect)

02/11/2022 13:07:21 Got connection from client 127.0.0.1
02/11/2022 13:07:21 Using protocol version 3.8
02/11/2022 13:07:21 Enabling TightVNC protocol extensions
02/11/2022 13:07:21 Advertising Tight auth cap 'VENCRYPT'
02/11/2022 13:07:21 Advertising Tight auth cap 'VNCAUTH_'
02/11/2022 13:07:21 Advertising Tight auth cap 'ULGNAUTH'
02/11/2022 13:07:21 Full-control authentication enabled for 127.0.0.1
02/11/2022 13:07:21 Pixel format for client 127.0.0.1:
02/11/2022 13:07:21   32 bpp, depth 24, little endian
02/11/2022 13:07:21   true colour: max r 255 g 255 b 255, shift r 16 g 8 b 0
02/11/2022 13:07:21   no translation needed
02/11/2022 13:07:21 Using tight encoding for client 127.0.0.1
02/11/2022 13:07:21 rfbProcessClientNormalMessage: ignoring unknown encoding -260 (fffffefc)
02/11/2022 13:07:21 Using JPEG subsampling 1, Q41 for client 127.0.0.1
02/11/2022 13:07:21 Interframe comparison enabled
02/11/2022 13:07:21 Enabling Desktop Size protocol extension for client 127.0.0.1
02/11/2022 13:07:21 Enabling LastRect protocol extension for client 127.0.0.1
02/11/2022 13:07:21 rfbProcessClientNormalMessage: ignoring unknown encoding -258 (fffffefe)
02/11/2022 13:07:21 Enabling Extended Desktop Size protocol extension for client 127.0.0.1
02/11/2022 13:07:21 rfbProcessClientNormalMessage: ignoring unknown encoding -309 (fffffecb)
02/11/2022 13:07:21 Enabling Fence protocol extension for client 127.0.0.1
02/11/2022 13:07:21 Enabling Continuous Updates protocol extension for client 127.0.0.1
02/11/2022 13:07:21 rfbProcessClientNormalMessage: ignoring unknown encoding -307 (fffffecd)
02/11/2022 13:07:21 rfbProcessClientNormalMessage: ignoring unknown encoding -1063131698 (c0a1e5ce)
02/11/2022 13:07:21 rfbProcessClientNormalMessage: ignoring unknown encoding 1464686180 (574d5664)
02/11/2022 13:07:21 Enabling full-color cursor updates for client 127.0.0.1
02/11/2022 13:07:21 Using Tight compression level 1 for client 127.0.0.1
02/11/2022 13:07:21 Using 4 threads for Tight encoding
02/11/2022 13:07:22 Continuous updates enabled
02/11/2022 13:07:22 New screen layout:
02/11/2022 13:07:22   0x00000000 (output 0x00000040): 3840x1970+0+0
02/11/2022 13:07:22 Continuous updates enabled
02/11/2022 13:07:22 Continuous updates enabled
02/11/2022 13:07:22 Continuous updates enabled
02/11/2022 13:13:48 Client 127.0.0.1 gone
02/11/2022 13:13:48 Statistics:
02/11/2022 13:13:48   key events received 0, pointer events 5
02/11/2022 13:13:48   framebuffer updates 13, rectangles 64, bytes 43168
02/11/2022 13:13:48     LastRect markers 1, bytes 12
02/11/2022 13:13:48     cursor shape updates 1, bytes 2388
02/11/2022 13:13:48     Tight rectangles 62, bytes 40768
02/11/2022 13:13:48   raw equivalent 30.333012 Mbytes, compression ratio 744.039737
02/11/2022 13:13:48 Interframe comparison disabled

Thanks
Vasile

Hello,

Not sure if this helps, but I ran an strace looking at the network connections for a websockify process and these were the last lines as the session got disconnected.

recvfrom(4, "\212\200!s\212\243", 4096, 0, NULL, NULL) = 6
sendto(4, "\211\0", 2, 0, NULL, 0)      = 2
recvfrom(4, "\212\200\251\355A\24", 4096, 0, NULL, NULL) = 6
recvfrom(4, "", 4096, 0, NULL, NULL)    = 0
shutdown(7, SHUT_RDWR)                  = 0
+++ exited with 0 +++

Thanks
Vasile

Oof, yeah I’m not sure. This is a tough one to troubleshoot with all the pieces involved, as I’m sure you well know by now.

But, something somewhere has to be logging what it is doing incorrectly. Checking the websockify and vnc logs is good, make sure to also be checking those /var/log/ondemand-nginx/<user> logs because somewhere along there it seems likely things are out of alignment. I’d even check the logs on the apache web-node for OnDemand to see if there’s anything there.

But, the OS upgrade is so strange to seem to correlate with the behavior, and so that makes this even more difficult to pin down.

I wish I had more but at the moment it is just looking across these logs to correlate something that is not right.

I am still digging into this. I applied the MPM settings I found in this thread

No luck, no improvement.
I did find segfault errors reported but they don’t seem to correlate with the session disconnects as the disconnects are happening very frequent and these segfaults are not matching the occurrences of disconnects.

messages-20221106:Nov  1 23:02:59 gl-login3 kernel: httpd[2178122]: segfault at c0 ip 00007f7671f24402 sp 00007f7638fe5d80 error 4 in libssl.so.1.1.1g[7f7671eff000+87000]
messages-20221106:Nov  4 19:50:48 gl-login3 kernel: httpd[1643509]: segfault at 18 ip 00007fbd613fc1b6 sp 00007fbcdbffb8f0 error 4 in libpthread-2.28.so[7fbd613ef000+1b000]
messages-20221113:Nov  6 11:00:52 gl-login3 kernel: httpd[436755]: segfault at 0 ip 00007fbd5f3370ed sp 00007fbd357f7250 error 4
messages-20221113:Nov  6 11:00:52 gl-login3 kernel: httpd[436756]: segfault at 0 ip 00007fbd5f3370ed sp 00007fbd34ff6250 error 4
messages-20221113:Nov  6 18:21:01 gl-login3 kernel: httpd[1456122]: segfault at 0 ip 00007fbd5f3370ed sp 00007fbcf67f9250 error 4
messages-20221113:Nov  6 18:21:01 gl-login3 kernel: httpd[1456120]: segfault at 0 ip 00007fbd5f3370ed sp 00007fbcf77fb250 error 4
messages-20221113:Nov 11 09:38:51 gl-login3 kernel: httpd[1313147]: segfault at 88 ip 00007fbd60f835f5 sp 00007fbd377fb188 error 4 in libc-2.28.so[7fbd60e26000+1bc0

Anyone else running this combination of OS and OnDemand, wondering if there is something to seeing this problem once we changed from CentOS 7.9 to Red Hat 8.4
OnDemand version: 2.0.28
OS Red Hat 8.4
Apache 2.4.37 (Red Hat Enterprise Linux)

Thanks
Vasile

In an effort to try to identify what might have changed when we upgraded from CentOS 7.9 to 8.4 I came across a few things that I wanted to get some more info on.

How is HTTP2 (mod_http2) used in regards to OnDemand or is it needed ?

I noticed that my old stage CentOS7.9 system that I loaded last December had 2.0.20 running mpm_worker as the multi process module and no mod_http2. I checked the RPM dependencies and it was not a requirement.

Under Red Hat 8.4, mod_http2.rpm is a requirement of the ondemand package install.

Apache states:

HTTP/2 Dimensioning

Enabling HTTP/2 on your Apache Server has impact on the resource consumption and if you have a busy site, you may need to consider carefully the implications.

That statement is a bit of concern, wondering if there are some issues with mod_http2 and mod_event.
Just grasping at things… but figured I share this if it helps someone see something.

Thanks
Vasile

Update (seems to be resolved):

I removed the loading of the http2 module and switched from mod_event to mod_worker. Inspiration came from sigfaults seen and evaluating how thing were under CentOS 7.9 vs Red Hat 8.4 apache.

All the sessions seem to be up and we have not seen any disconnects or sigfaults reported by apache in the last 12 hours.

Thanks
Vasile

1 Like