OOD apps stopped working

We had a storage cluster migration this weekend so I took the opportunity to upgrade the kernel on my worker nodes and now I’m getting dreaded noVNC screen.

We are running Rocky 8 with OOD 3.1.4
/tmp is local storage and not related to the migration.

Below is my output.log

Setting VNC password...
Starting VNC server...

WARNING: e3-compute-0-0.tch.harvard.edu:1 is taken because of /tmp/.X1-lock
Remove this file if there is no X server e3-compute-0-0.tch.harvard.edu:1

WARNING: e3-compute-0-0.tch.harvard.edu:2 is taken because of /tmp/.X2-lock
Remove this file if there is no X server e3-compute-0-0.tch.harvard.edu:2
sh: /home/ch230108/.vnc/e3-compute-0-0.tch.harvard.edu:3.pid: Invalid argument
cat: '/home/ch230108/.vnc/e3-compute-0-0.tch.harvard.edu:3.pid': No such file or directory
Could not start Xvnc.


TurboVNC Server (Xvnc) 64-bit v3.1 (build 20231117)
Copyright (C) 1999-2023 The VirtualGL Project and many others (see README.md)
Visit http://www.TurboVNC.org for more information on TurboVNC

16/07/2024 16:13:32 Using security configuration file /etc/turbovncserver-security.conf
16/07/2024 16:13:32 Enabled security type 'tlsvnc'
16/07/2024 16:13:32 Enabled security type 'tlsotp'
16/07/2024 16:13:32 Enabled security type 'tlsplain'
16/07/2024 16:13:32 Enabled security type 'x509vnc'
16/07/2024 16:13:32 Enabled security type 'x509otp'
16/07/2024 16:13:32 Enabled security type 'x509plain'
16/07/2024 16:13:32 Enabled security type 'vnc'
16/07/2024 16:13:32 Enabled security type 'otp'
16/07/2024 16:13:32 Enabled security type 'unixlogin'
16/07/2024 16:13:32 Enabled security type 'plain'
16/07/2024 16:13:32 Desktop name 'TurboVNC: e3-compute-0-0.tch.harvard.edu:3 (ch230108)' (e3-compute-0-0.tch.harvard.edu:3)
16/07/2024 16:13:32 Protocol versions supported: 3.3, 3.7, 3.8, 3.7t, 3.8t
16/07/2024 16:13:32 Listening for VNC connections on TCP port 5903
16/07/2024 16:13:32   Interface 0.0.0.0
16/07/2024 16:13:32 Framebuffer: BGRX 8/8/8/8
16/07/2024 16:13:32 New desktop size: 1240 x 900
16/07/2024 16:13:32 New screen layout:
16/07/2024 16:13:32   0x00000040 (output 0x00000040): 1240x900+0+0
16/07/2024 16:13:32 Maximum clipboard transfer size: 1048576 bytes
16/07/2024 16:13:33 VNC extension running!
Successfully started VNC server on e3-compute-0-0.tch.harvard.edu:5900...
Script starting...
Starting websocket server...
[websockify]: pid: 505181 (proxying 6022 ==> localhost:5900)
[websockify]: log file: ./websockify.log
[websockify]: waiting ...
The system default contains no modules
  (env var: LMOD_SYSTEM_DEFAULT_MODULES is empty)
  No changes in loaded modules

Launching desktop 'xfce'...
Failed to init libxfconf: Error spawning command line “dbus-launch --autolaunch=6f05bd93862a4212a408e551b9d9dd66 --binary-syntax --close-stderr”: Child process exited with code 1.
Failed to init libxfconf: Error spawning command line “dbus-launch --autolaunch=6f05bd93862a4212a408e551b9d9dd66 --binary-syntax --close-stderr”: Child process exited with code 1.
Unable to init server: Could not connect: Connection refused
xfce4-session: Cannot open display: .
Type 'xfce4-session --help' for usage.
Desktop 'xfce' ended with 1 status...
[websockify]: started successfully (proxying 6022 ==> localhost:5900)
Scanning VNC log file for user authentications...
Generating connection YAML file...
Cleaning up...

And my script.sh.erb for my bc_desktop which I guess is the standard out of the box:

#!/usr/bin/env bash

# Change working directory to user's home directory
cd "${HOME}"

# Reset module environment (may require login shell for some HPC clusters)
module purge && module restore

# Ensure that the user's configured login shell is used
export SHELL="$(getent passwd $USER | cut -d: -f7)"

# Start up desktop
echo "Launching desktop '<%= context.desktop %>'..."
source "<%= session.staged_root.join("desktops", "#{context.desktop}.sh") %>"
echo "Desktop '<%= context.desktop %>' ended with $? status..."

turbovnc is 3.1
websockify is 0.12.0

Hey sorry for the trouble!

It looks very similar to one of the errors seen in this previous Discourse post. For them it ended up being a permissions issue:

I saw that thread. Their /tmp wasn’t writeable. That’s why I mentioned /tmp above, it looks good.

Ok. What happens when you are manually issuing the command through a shell on the compute nodes? What do you see when you issue systemctl status dbus?

Looks good I guess.

[root@e3-compute-0-0 ~]# systemctl status dbus
● dbus.service - D-Bus System Message Bus
   Loaded: loaded (/usr/lib/systemd/system/dbus.service; static; vendor preset: disabled)
   Active: active (running) since Fri 2024-07-12 13:45:54 EDT; 4 days ago
     Docs: man:dbus-daemon(1)
 Main PID: 1026 (dbus-daemon)
    Tasks: 1 (limit: 48893)
   Memory: 2.0M
   CGroup: /system.slice/dbus.service
           └─1026 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only

Jul 15 14:20:08 e3-compute-0-0.tch.harvard.edu dbus-daemon[1026]: [system] Reloaded configuration
Jul 15 14:20:09 e3-compute-0-0.tch.harvard.edu dbus-daemon[1026]: [system] Reloaded configuration
Jul 15 14:20:09 e3-compute-0-0.tch.harvard.edu dbus-daemon[1026]: [system] Reloaded configuration
Jul 15 14:20:09 e3-compute-0-0.tch.harvard.edu dbus-daemon[1026]: [system] Reloaded configuration
Jul 16 01:35:15 e3-compute-0-0.tch.harvard.edu dbus-daemon[1026]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop>
Jul 16 01:35:15 e3-compute-0-0.tch.harvard.edu dbus-daemon[1026]: [system] Successfully activated service 'org.freedesktop.timedate1'
Jul 16 08:32:02 e3-compute-0-0.tch.harvard.edu dbus-daemon[1026]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop>
Jul 16 08:32:02 e3-compute-0-0.tch.harvard.edu dbus-daemon[1026]: [system] Successfully activated service 'org.freedesktop.hostname1'
Jul 17 02:10:17 e3-compute-0-0.tch.harvard.edu dbus-daemon[1026]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop>
Jul 17 02:10:17 e3-compute-0-0.tch.harvard.edu dbus-daemon[1026]: [system] Successfully activated service 'org.freedesktop.timedate1'

I think the new storage cluster is not accepting : as a charecter in paths. We got a lab reporting other VNC problems and they trace it back to the colon. Our storage team is heads deep in other issues on this migration so we cannot confirm yet. Any comments would be appreciated.

This was the “:” char not being allowed in paths. Our storage guy flipped the switch to allow it and now everything works.