We have a situation that our noVNC desktop (and similar apps) crash when the forth session starts on the same compute node. The first three are fine, they get a correct display, and the session can start, but the forth one is apparently cursed.
In order to inspect that, I like to debug the job_script_content.sh more closely in our Sandbox environment, and I wonder how I can modify the script (which generates this script)? So far, I haven’t found a relevant tip in the official docs about the job submit attributes.
There is only a couple entry points here to modify job_script_content.sh, but even so - you’re more likely to want to edit script.sh. job_script_content.sh is more like a wrapper script.
Beyond that - you’d have to really dig deep in the source code to do so.
For example, one entry point is a wrapper in the job_script_content.sh, wrapping around script.sh that you could configure like so to spit out the env and set -xthen run the script.sh script.
batch_connect:
basic:
script_wrapper: |
env
set -x
%s
Or you can just edit the script.sh to say strace xfce4-session instead of just issuing xfce4-session.
Here’s some reference material on different entrypoints to reconfigure. Noteably vnc_args. We (OSC) recently had to update so that we always get a display value >= 10. So we wrote a little wrapper script and supply vnc_args: ":$(/opt/osc/sbin/find_display 2>/dev/null)" to force a DISPLAY value instead to vncserver instead of letting it try to find one.
I followed your suggestion to include the script_wrapper attribute into the submit.yml.erb, but still my output.log does not contain the dump of my env and the expected output from set -x. Here is how my submit script looks like:
Now, looking at the logs of the fourth/failing session, I see that something goes wrong in this block inside the job_script_content.sh :
for i in $(seq 1 10); do
# Clean up any old VNC sessions that weren't cleaned before
apptainer exec instance://5da7cc5e-6f86-4f3a-a3fb-6b5c4366f67f vncserver -list | awk '/^:/{system("kill -0 "$2" 2>/dev/null || apptainer exec instance://5da7cc5e-6f86-4f3a-a3fb-6b5c4366f67f vncserver -kill "$1)}'
# for turbovnc 3.0 compatability.
if timeout 2 apptainer exec instance://5da7cc5e-6f86-4f3a-a3fb-6b5c4366f67f vncserver --help 2>&1 | grep 'nohttpd' >/dev/null 2>&1; then
HTTPD_OPT='-nohttpd'
fi
# Attempt to start VNC server
VNC_OUT=$(apptainer exec instance://5da7cc5e-6f86-4f3a-a3fb-6b5c4366f67f vncserver -log "vnc.log" -rfbauth "vnc.passwd" $HTTPD_OPT -noxstartup -geometry 1024x576 2>&1)
VNC_PID=$(pgrep -s 0 Xvnc) # the script above will daemonize the Xvnc process
echo "${VNC_PID}"
echo "${VNC_OUT}"
# Sometimes Xvnc hangs if it fails to find working disaply, we
# should kill it and try again
kill -0 ${VNC_PID} 2>/dev/null && [[ "${VNC_OUT}" =~ "Fatal server error" ]] && kill -TERM ${VNC_PID}
# Check that Xvnc process is running, if not assume it died and
# wait some random period of time before restarting
kill -0 ${VNC_PID} 2>/dev/null || sleep 0.$(random_number 1 9)s
# If running, then all is well and break out of loop
kill -0 ${VNC_PID} 2>/dev/null && break
done
My two cents: for some reason, only in the 4th/failing session, for the first iteration of the loop above, apptainer exec … vncserver - list | awk ‘something something’ || apptainer … kill kicks in and terminates the session; that leads to vncserver not being found error in the logs.
But, if my hypothesis is correct, I still wonder why this happens only for the fourth session. I also wonder if some resources are exhausted on the slave node leading to this abrupt failure. If so, which limits could be possibly exhausted?