Ran into trouble when configuring interactive desktop on a slurm cluster

Hi There,

I am trying to deploy OOD 3.0 on a slurm cluster with OS version Centos 8.4.

Everything works fine except the interactive desktop and interactive apps.

TurboVNC and Websockify were installed in a public path and can be started successfully.
I also installed xfce in the public path with command ‘dnf --installroot=/public/software/wesee/xfce groupinstall xfce’, however, interactive desktop cannot start, neither can interactive apps using xfce desktop.

Here’s the job output.log:

Setting VNC password…
Starting VNC server…
Killing Xvnc process ID 1144317
Xvnc process ID 1144317 already killed
Xvnc did not appear to shut down cleanly. Removing /tmp/.X11-unix/X5
Xvnc did not appear to shut down cleanly. Removing /tmp/.X5-lock

WARNING: sim3:1 is taken because of /tmp/.X1-lock
Remove this file if there is no X server sim3:1

WARNING: sim3:2 is taken because of /tmp/.X2-lock
Remove this file if there is no X server sim3:2

WARNING: sim3:3 is taken because of /tmp/.X3-lock
Remove this file if there is no X server sim3:3

WARNING: sim3:4 is taken because of /tmp/.X4-lock
Remove this file if there is no X server sim3:4

Desktop ‘TurboVNC: sim3:5 (hpctest)’ started on display sim3:5

Log file is vnc.log
Successfully started VNC server on sim3:5900…
Script starting…
Starting websocket server…
ERROR: Collection default cannot be found
Launching desktop ‘xfce’…
Failed to init libxfconf: Error spawning command line “dbus-launch --autolaunch=499c7de025e24973b45c7ee39a1c82b9 --binary-syntax --close-stderr”: Child process exited with code 1.
/public/software/wesee/websockify/usr/lib/python3.6/site-packages/websockify/websocket.py:31: UserWarning: no ‘numpy’ module, HyBi protocol will be slower
warnings.warn(“no ‘numpy’ module, HyBi protocol will be slower”)
WebSocket server settings:

  • Listen on :38144
  • No SSL/TLS support (no cert file)
  • Backgrounding (daemon)
    Scanning VNC log file for user authentications…
    Generating connection YAML file…
    Failed to init libxfconf: Error spawning command line “dbus-launch --autolaunch=499c7de025e24973b45c7ee39a1c82b9 --binary-syntax --close-stderr”: Child process exited with code 1.
    Unable to init server: Could not connect: 拒绝连接
    xfce4-session: Cannot open display: .
    Type ‘xfce4-session --help’ for usage.
    Desktop ‘xfce’ ended…
    Cleaning up…
    /opt/gridview/slurm/spool/slurmd/job00061/slurm_script: 行 25: 1148837 已终止 while read -r line; do
    if [[ ${line} =~ “Full-control authentication enabled for” ]]; then
    change_passwd; create_yml;
    fi;
    done < <(tail -f --pid=${SCRIPT_PID} “vnc.log”)

My cluster configuration file content is as follows:

v2:
  metadata:
    title: "Cluster"
  login:
    host: "10.10.10.100"
  job:
    adapter: "slurm"
    bin: "/opt/gridview/slurm/usr/bin"
    conf: "/opt/gridview/slurm/etc/slurm/slurm.conf"
  batch_connect:
    basic:
        script_wrapper: |
            module purge
            %s

    vnc:
        script_wrapper: |
            module purge
            export PATH="/public/software/wesee/TurboVNC/bin:$PATH"
            export PATH="/public/software/wesee/xfce/bin:$PATH"
            export WEBSOCKIFY_CMD="/public/software/wesee/websockify/usr/bin/websockify"
            %s

Can anyone help me check how this problem might be occurring?

The cluster consists of one control node (i.e. 10.10.10.100) and two computing nodes (i.e. 10.10.10.101 / 10.10.10.102). Did I miss something in the configuration file?

Also, my understanding is that TurboVNC, Websockify and XFCE desktop all need to be installed under a public path, where accessible to all nodes, is that correct?

Is there any way that I can install these services only on the control node, and users can still submit jobs to cluster via control node’s GUI desktop?

This seems to be your main issue - that you can’t launch dbus-launch correctly. This could come from any number of issues, but you should be able to check journalctl on that machine and see some error logs (or maybe /var/log/messages)

No because the jobs execution is on the compute node. I.e., the job is on the compute node, so these libraries need to be present on the compute node because that’s where the processes’ are running.