Debugging job_script_content.sh

Hey all

We have a situation that our noVNC desktop (and similar apps) crash when the forth session starts on the same compute node. The first three are fine, they get a correct display, and the session can start, but the forth one is apparently cursed.

In order to inspect that, I like to debug the job_script_content.sh more closely in our Sandbox environment, and I wonder how I can modify the script (which generates this script)? So far, I haven’t found a relevant tip in the official docs about the job submit attributes.

So, here is where the community shines!

With kind regards
Ehsan

There is only a couple entry points here to modify job_script_content.sh, but even so - you’re more likely to want to edit script.sh. job_script_content.sh is more like a wrapper script.

Beyond that - you’d have to really dig deep in the source code to do so.

For example, one entry point is a wrapper in the job_script_content.sh, wrapping around script.sh that you could configure like so to spit out the env and set -x then run the script.sh script.

  batch_connect:
    basic:
      script_wrapper: |
        env
        set -x
        %s

Or you can just edit the script.sh to say strace xfce4-session instead of just issuing xfce4-session.

Here’s some reference material on different entrypoints to reconfigure. Noteably vnc_args. We (OSC) recently had to update so that we always get a display value >= 10. So we wrote a little wrapper script and supply vnc_args: ":$(/opt/osc/sbin/find_display 2>/dev/null)" to force a DISPLAY value instead to vncserver instead of letting it try to find one.

and

Thanks @jeff.ohrstrom for your swift response.

I followed your suggestion to include the script_wrapper attribute into the submit.yml.erb, but still my output.log does not contain the dump of my env and the expected output from set -x. Here is how my submit script looks like:

batch_connect:
  before_script: |
  template: “vnc_container”
  container_path: <%= (cluster == ‘genius’) ? rocky8_container : rocky9_container %>
  container_bindpath: “$VSC_HOME,$VSC_DATA,<%= cluster == ‘sandbox’ ? ‘,’ : ‘$VSC_SCRATCH,/lustre1/project:/staging/leuven,’ %>/vsc-hard-mounts/leuven-apps:/apps/leuven,/var,/run”
  container_module: “”
  container_command: “apptainer”
  websockify_cmd: “/usr/bin/websockify”
  set_host: “host=$(hostname -f)”
geometry: “<%= global_vnc_resolution.blank? ? “800x600” : global_vnc_resolution %>”
  container_start_args: <%= global_bc_num_gpu_slots.to_i > 0 ? “–nv” : “” %>
idle: “”
script_wrapper: |
  echo “[DEBUG] starting”
  env
  set -x
  %s
  echo “[DEBUG] Exit code: $?”
script:
  queue_name: <%= auto_queues %>
  email_on_started: <%= bc_email_on_started %>
  accounting_id: <%= global_account %>
  native:
    <%- unless global_num_cores.blank? %>
    - “-c”
    - “<%= global_num_cores.to_i %>”
    <%- end %>
    - “-N”
    - “<%= bc_num_slots.blank? ? 1 : bc_num_slots.to_i %>”
    - “–mem-per-cpu”
    - “<%= global_num_memory.blank? ? 3400 : global_num_memory %>”
<%- unless bc_email_on_started == “0” %>
    - “–mail-user”
    - “<%= global_email %>”
    <%- end %>
    <%- unless global_reservation.blank? %>
    - “–reservation”
    - “<%= global_reservation %>”
    <%- end %>

(I guess the indentations are ruined in the code block above after I copy-pasted).

Do you catch any flaws here?
Is the script_wrapper attribute effective when the template is vnc_container?

E.

scritp_wrapper should be under batch_connect as a child, not a sibling to it.

Of course, you’re right. Fixed that syntax issue.

Now, looking at the logs of the fourth/failing session, I see that something goes wrong in this block inside the job_script_content.sh :

for i in $(seq 1 10); do
  # Clean up any old VNC sessions that weren't cleaned before
  apptainer exec instance://5da7cc5e-6f86-4f3a-a3fb-6b5c4366f67f vncserver -list | awk '/^:/{system("kill -0 "$2" 2>/dev/null || apptainer exec instance://5da7cc5e-6f86-4f3a-a3fb-6b5c4366f67f vncserver -kill "$1)}'

  # for turbovnc 3.0 compatability.
  if timeout 2 apptainer exec instance://5da7cc5e-6f86-4f3a-a3fb-6b5c4366f67f vncserver --help 2>&1 | grep 'nohttpd' >/dev/null 2>&1; then
    HTTPD_OPT='-nohttpd'
  fi

  # Attempt to start VNC server
  VNC_OUT=$(apptainer exec instance://5da7cc5e-6f86-4f3a-a3fb-6b5c4366f67f vncserver -log "vnc.log" -rfbauth "vnc.passwd" $HTTPD_OPT -noxstartup -geometry 1024x576  2>&1)
  VNC_PID=$(pgrep -s 0 Xvnc) # the script above will daemonize the Xvnc process
  echo "${VNC_PID}"
  echo "${VNC_OUT}"

  # Sometimes Xvnc hangs if it fails to find working disaply, we
  # should kill it and try again
  kill -0 ${VNC_PID} 2>/dev/null && [[ "${VNC_OUT}" =~ "Fatal server error" ]] && kill -TERM ${VNC_PID}

  # Check that Xvnc process is running, if not assume it died and
  # wait some random period of time before restarting
  kill -0 ${VNC_PID} 2>/dev/null || sleep 0.$(random_number 1 9)s

  # If running, then all is well and break out of loop
  kill -0 ${VNC_PID} 2>/dev/null && break
done

My two cents: for some reason, only in the 4th/failing session, for the first iteration of the loop above, apptainer exec … vncserver - list | awk ‘something something’ || apptainer … kill kicks in and terminates the session; that leads to vncserver not being found error in the logs.

But, if my hypothesis is correct, I still wonder why this happens only for the fourth session. I also wonder if some resources are exhausted on the slave node leading to this abrupt failure. If so, which limits could be possibly exhausted?