OOD desktop doesn't work on Rocky 8.7

Hello,

I am testing ood desktop with Rocky 8.7 compute node. However, ood desktop keeps failing to start MATE on the node.
I followed MATE instructions from Rocky “MATE Desktop - Documentation

Here is the error log.

Setting VNC password…
Starting VNC server…

Desktop ‘TurboVNC: udc-aj36-24:1 (gp4r)’ started on display udc-aj36-24:1

Log file is vnc.log
Successfully started VNC server on udc-aj36-24:5901…
Script starting…
Starting websocket server…
The system default contains no modules
(env var: LMOD_SYSTEM_DEFAULT_MODULES is empty)
No changes in loaded modules

Launching desktop ‘mate’…
cat: /etc/xdg/autostart/gnome-keyring-gpg.desktop: No such file or directory
cat: /etc/xdg/autostart/mate-volume-control-applet.desktop: No such file or directory
cat: /etc/xdg/autostart/pulseaudio.desktop: No such file or directory
cat: /etc/xdg/autostart/rhsm-icon.desktop: No such file or directory
cat: /etc/xdg/autostart/spice-vdagent.desktop: No such file or directory
cat: /etc/xdg/autostart/xfce4-power-manager.desktop: No such file or directory
WebSocket server settings:

  • Listen on :9584
  • No SSL/TLS support (no cert file)
  • Backgrounding (daemon)
    Scanning VNC log file for user authentications…
    Generating connection YAML file…
    mate-session[124600]: WARNING: Could not parse desktop file /home/gp4r/.config/autostart/rhsm-icon.desktop: Key file does not start with a group
    mate-session[124600]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x55d5e34fb8b0 finalized while still in-construction
    mate-session[124600]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
    mate-session[124600]: WARNING: could not read /home/gp4r/.config/autostart/rhsm-icon.desktop
    mate-session[124600]: WARNING: Could not parse desktop file /home/gp4r/.config/autostart/spice-vdagent.desktop: Key file does not start with a group
    mate-session[124600]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x55d5e34fb970 finalized while still in-construction
    mate-session[124600]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
    mate-session[124600]: WARNING: could not read /home/gp4r/.config/autostart/spice-vdagent.desktop
    mate-session[124600]: WARNING: Could not parse desktop file /home/gp4r/.config/autostart/pulseaudio.desktop: Key file does not start with a group
    mate-session[124600]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x55d5e34fbaf0 finalized while still in-construction
    mate-session[124600]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
    mate-session[124600]: WARNING: could not read /home/gp4r/.config/autostart/pulseaudio.desktop
    mate-session[124600]: WARNING: Could not parse desktop file /home/gp4r/.config/autostart/gnome-keyring-gpg.desktop: Key file does not start with a group
    mate-session[124600]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x55d5e34fbbb0 finalized while still in-construction
    mate-session[124600]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
    mate-session[124600]: WARNING: could not read /home/gp4r/.config/autostart/gnome-keyring-gpg.desktop
    mate-session[124600]: WARNING: Could not parse desktop file /home/gp4r/.config/autostart/mate-volume-control-applet.desktop: Key file does not start with a group
    mate-session[124600]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x55d5e34fbbb0 finalized while still in-construction
    mate-session[124600]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
    mate-session[124600]: WARNING: could not read /home/gp4r/.config/autostart/mate-volume-control-applet.desktop
    mate-session[124600]: WARNING: Could not parse desktop file /home/gp4r/.config/autostart/xfce4-power-manager.desktop: Key file does not start with a group
    mate-session[124600]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x55d5e34fbaf0 finalized while still in-construction
    mate-session[124600]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
    mate-session[124600]: WARNING: could not read /home/gp4r/.config/autostart/xfce4-power-manager.desktop
    ALSA lib pulse.c:243:(pulse_connect) PulseAudio: Unable to connect: Connection refused

** (mate-settings-daemon:124638): WARNING **: 14:12:59.993: Could not open RFKILL control device, please verify your installation

ERROR: NVIDIA driver is not loaded

gnome-keyring-daemon: insufficient process capabilities, insecure memory might get used
** Message: 14:13:00.721: couldn’t access control socket: /tmp/928741/keyring/control: No such file or directory
gnome-keyring-daemon: insufficient process capabilities, insecure memory might get used
gnome-keyring-daemon: insufficient process capabilities, insecure memory might get used

(mate-power-manager:124682): PowerManager-ERROR **: 14:13:00.881: Error in dbus - GDBus.Error:org.freedesktop.DBus.Error.AccessDenied: Permission denied
SSH_AUTH_SOCK=/tmp/928741/keyring/ssh

(polkit-mate-authentication-agent-1:124708): polkit-mate-1-WARNING **: 14:13:00.907: Unable to determine the session we are in: No session for pid 124708
Initializing caja-image-converter extension
Initializing caja-xattr-tags extension
Initializing caja-sendto extension
Initializing caja-open-terminal extension
Initializing caja-wallpaper extension
SELinux Troubleshooter: Applet requires SELinux be enabled to run
mate-session[124600]: WARNING: Detected that screensaver has left the bus
abrt-applet: Problem connecting to dbus

(mate-settings-daemon:124638): GLib-GObject-CRITICAL **: 14:13:06.832: g_object_unref: assertion ‘G_IS_OBJECT (object)’ failed

(mate-settings-daemon:124638): GLib-GObject-CRITICAL **: 14:13:06.832: g_object_unref: assertion ‘G_IS_OBJECT (object)’ failed
[1680804780,000,xklavier.c:xkl_engine_start_listen/] The backend does not require manual layout management - but it is provided by the application
Desktop ‘mate’ ended…
Cleaning up…
Killing Xvnc process ID 124533

I would appreciate it if you could help me how to resolve this issue.

Thank you

What version of OOD are you running?

These lines are jumping out at me. Is SELinux enabled on your system?

OOD 3.0 on Rocky 8.7

I saw the message but SELinux is disabled on both OOD and the compute node.

SELinux status:                 disabled

Could you post the submit/*.yml you are using for the desktop along with the form.yml? I’m thinking the SELinux is a dead end, and I want to see what is being passed back as this launches.

we don’t have submit folder

Blockquote
[root@ood-dev uva_desktop]# ls
CHANGELOG.md form.yml LICENSE.txt manifest.yml submit.yml.erb template

[root@ood-dev uva_desktop]# cat form.yml
# Batch Connect app configuration file
#
# @note Used to define the submitted cluster, title, description, and
#   hard-coded/user-defined attributes that make up this Batch Connect app.
---

# **MUST** set cluster id here that matches cluster configuration file located
# under /etc/ood/config/clusters.d/*.yml
# @example Use the Owens cluster at Ohio Supercomputer Center
#     cluster: "owens"
cluster: "rivanna"

form:
  - desktop
  - bc_vnc_idle
  - bc_vnc_resolution
  - node_type
  - bc_account
  - bc_num_hours
  - num_cores
  - num_memory
  - gpu_type
  - num_gpu
  - option
  - optional_group


attributes:
  desktop: "mate"
  bc_vnc_idle: 0
  bc_vnc_resolution:
    required: true
  num_cores:
    widget: "number_field"
    label: "Number of cores"
    value: 1
    min: 1
    max: 40
    step: 1
  num_memory:
    widget: "number_field"
    label: "Memory Request in GB ( maximum 256G )"
    value: 6
    min: 6
    max: 256
    step: 1
  bc_account:
    label: "Required: Allocation Name"
  node_type:
    widget: select
    label: "Rivanna Partition"
    options:
      - [ "Standard",     "standard"     ]
      - [ "GPU",     "gpu"     ]
      - [ "BII",     "bii"     ]
      - [ "BII-GPU",     "bii-gpu"     ]
      - [ "Dev",     "dev"     ]
      - [ "Instructional",  "instructional"  ]
      - [ "CHASE",     "pcore"     ]

    help: |
      - **Standard** - (*1-40 cores*) Rivanna node in the standard partition.
      - **Bii,Bii-gpu** - (*1-40 cores*) Rivanna partition for Biocomplexity Institute and Initiative.
      - **GPU** - (*1-28 cores*) Rivanna node that has NVIDIA GPU.
      - **Dev** - (*1-8 cores*) For short sessions (= 1 hour) with no
        SU charge; walltime is strictly limited to an hour.
      - **Learn More** - [Rivanna Queuing Policies]

      [Rivanna Queuing Policies]: https://www.rc.virginia.edu/userinfo/rivanna/queues/

  gpu_type:
    widget: select
    label: "Optional: GPU type for GPU partition"
    options:
      - [ "default",     "--gres=gpu:"     ]
      - [ "NVIDIA A100",     "--gres=gpu:a100:"     ]
      - [ "NVIDIA V100",     "--gres=gpu:v100:"     ]
      - [ "NVIDIA P100",     "--gres=gpu:p100:"     ]
      - [ "NVIDIA K80",     "--gres=gpu:k80:"     ]
      - [ "NVIDIA RTX2080",     "--gres=gpu:rtx2080:"     ]
  num_gpu:
    widget: "number_field"
    label: "Optional: Number of GPUs ( 1 ~ 4)"
    value: 1
    min: 1
    max: 4
    step: 1
  option:
    label: "Optional: Slurm Option"
  optional_group:
    label: "Optional: Group (for access to software or storage)"

The submit.yml.erb would be the file if there’s no submit folder.

Here is the submit.yml.erb

---
batch_connect:
  template: vnc
script:
  queue_name: "<%= node_type %>"
  native:
    - "-J"
    - "ood_desktop"
    - "-N"
    - "1"
    - "--cpus-per-task"
    - "<%= num_cores %>"
    - "--mem"
    - "<%= num_memory %>G"
    - "--output=desktop_open_ondemand.log"
    <%- unless option == "" -%>
    - "<%= option %>"
    <%- end -%>

I tested it on 4 different environments. They submitted a job to the same host (CentOS 7.9).

The results are;

  1. OOD 1.6 on Centos 7.8 - Success
  2. OOD 2.0.32 on Centos 7.9 - Success
  3. OOD 3.0 on Centos 7.8 - Failed to connect to server
  4. OOD 3.0 on Rocky 8.7 - Failed to connect to server

Is there something changed in 3.0 to connect VNC?

No, there shouldn’t be a change, but I’m wondering if enforcement around an OOD convention is causing the issue.

If you look at the docs, anything under that root directory of the app will be treated as a form.yml and as such, OOD thinks it is an app. To ensure it is looking at those submit.yml and your_cluster.yml files correctly, you need to create a directory in the apps root such as submit/ and land the submit.yml.erb there while also adding the following to the app’s form.yml:

...
submit: "submit/my_submit.yml.erb"`
...

Which you can see here:
https://osc.github.io/ood-documentation/latest/enable-desktops/custom-job-submission.html

Then, for the cluster to work right using the docs here:
https://osc.github.io/ood-documentation/latest/enable-desktops/add-cluster.html

You will want a my_cluster.yml in the clusters.d/ dir you make, and then make sure the name of that file matches what you add to the form.yml:

...
cluster: "my_cluster"
...

Once you have that all setup, what happens at submit for the job?

It may be that previous versions did not enforce this file structure as strictly and why they worked, but there should be this separation of submits, forms, and cluster files to help make sense of what the various yml files do.

Thanks for the suggestions.
I created “/etc/ood/config/apps/uva_desktop/rivanna.yml”

---
title: "Desktop"
cluster: "rivanna"
submit: "submit/submit.yml.erb"

and moved “submit.yml.erb” into submit folder.
However, I got sbatch: error: Batch job submission failed: No partition specified or system default partition Slurm error.

I moved “submit.yml.erb” back to root directory and I was able to submit a job.

But still no VNC error.

You will want to leave that file in the submit folder or OOD will not recognize it as the submit. That is called out in the Danger banner in the docs:
https://osc.github.io/ood-documentation/latest/enable-desktops/custom-job-submission.html

Ensure you have the rivanna.yml cluster file in the /etc/ood/config/clusters.d/ directory.

Your app directory should have the following form at the end:

uva_desktop/rivanna.yml
uva_desktop/submit/submit.yml.erb

And the uva_desktop/rivannv.yml should have:

---
title: "Desktop"
cluster: "rivanna"
submit: "submit/submit.yml.erb"

And at that point, is this when you get the sbatch: error: Batch job submission failed: No partition specified or system default partition Slurm error. error?

I misunderstood and created submit folder in /var/www/ood/apps/sys/uva_desktop.

I moved the submit folder into /etc/ood/config/apps/uva_desktop/ and I have no issue submitting a job.

However, VNC still doesn’t work.

Ok, is the error that sbatch: error: Batch job submission failed: No partition specified or system default partition Slurm error message at this point or something else now?

no more Slurm error messages. Only VNC connection is the issue.

Could you post the output of the output.log and vnc.log from the session.

Setting VNC password...
Starting VNC server...

WARNING: udc-ba27-18:9 is taken because of /tmp/.X9-lock
Remove this file if there is no X server udc-ba27-18:9
Killing Xvnc process ID 31149
Xvnc process ID 31149 already killed
Xvnc did not appear to shut down cleanly. Removing /tmp/.X11-unix/X9
Xvnc did not appear to shut down cleanly. Removing /tmp/.X9-lock

WARNING: udc-ba27-18:1 is taken because of /tmp/.X1-lock
Remove this file if there is no X server udc-ba27-18:1

WARNING: udc-ba27-18:2 is taken because of /tmp/.X2-lock
Remove this file if there is no X server udc-ba27-18:2

WARNING: udc-ba27-18:3 is taken because of /tmp/.X3-lock
Remove this file if there is no X server udc-ba27-18:3

WARNING: udc-ba27-18:4 is taken because of /tmp/.X4-lock
Remove this file if there is no X server udc-ba27-18:4

WARNING: udc-ba27-18:5 is taken because of /tmp/.X5-lock
Remove this file if there is no X server udc-ba27-18:5

WARNING: udc-ba27-18:6 is taken because of /tmp/.X6-lock
Remove this file if there is no X server udc-ba27-18:6

WARNING: udc-ba27-18:7 is taken because of /tmp/.X7-lock
Remove this file if there is no X server udc-ba27-18:7

WARNING: udc-ba27-18:8 is taken because of /tmp/.X8-lock
Remove this file if there is no X server udc-ba27-18:8

Desktop 'TurboVNC: udc-ba27-18:9 (gp4r)' started on display udc-ba27-18:9

Log file is vnc.log
Successfully started VNC server on udc-ba27-18:5909...
Script starting...
Starting websocket server...
The system default contains no modules
  (env var: LMOD_SYSTEM_DEFAULT_MODULES is empty)
  No changes in loaded modules

Launching desktop 'mate'...
cat: /etc/xdg/autostart/gnome-keyring-gpg.desktop: No such file or directory
cat: /etc/xdg/autostart/pulseaudio.desktop: No such file or directory
cat: /etc/xdg/autostart/rhsm-icon.desktop: No such file or directory
cat: /etc/xdg/autostart/xfce4-power-manager.desktop: No such file or directory
WebSocket server settings:
  - Listen on :10043
  - Flash security policy server
  - No SSL/TLS support (no cert file)
  - Backgrounding (daemon)
Scanning VNC log file for user authentications...
Generating connection YAML file...
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall
mate-session[8096]: WARNING: Could not parse desktop file /home/gp4r/.config/autostart/rhsm-icon.desktop: Key file does not start with a group
mate-session[8096]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x1885250 finalized while still in-construction
mate-session[8096]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
mate-session[8096]: WARNING: could not read /home/gp4r/.config/autostart/rhsm-icon.desktop
mate-session[8096]: WARNING: Could not parse desktop file /home/gp4r/.config/autostart/pulseaudio.desktop: Key file does not start with a group
mate-session[8096]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x18854c0 finalized while still in-construction
mate-session[8096]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
mate-session[8096]: WARNING: could not read /home/gp4r/.config/autostart/pulseaudio.desktop
mate-session[8096]: WARNING: Could not parse desktop file /home/gp4r/.config/autostart/gnome-keyring-gpg.desktop: Key file does not start with a group
mate-session[8096]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x18854c0 finalized while still in-construction
mate-session[8096]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
mate-session[8096]: WARNING: could not read /home/gp4r/.config/autostart/gnome-keyring-gpg.desktop
mate-session[8096]: WARNING: Could not parse desktop file /home/gp4r/.config/autostart/xfce4-power-manager.desktop: Key file does not start with a group
mate-session[8096]: GLib-GObject-CRITICAL: object GsmAutostartApp 0x7f25b800a6a0 finalized while still in-construction
mate-session[8096]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
mate-session[8096]: WARNING: could not read /home/gp4r/.config/autostart/xfce4-power-manager.desktop
SELinux Troubleshooter: Applet requires SELinux be enabled to run.

ERROR: NVIDIA driver is not loaded


ERROR: Unable to load info from any available system

*** ERROR ***
TI:09:05:27	TH:0x24e7060	FI:gpm-manager.c	FN:gpm_manager_systemd_inhibit,1784
 - Error in dbus - GDBus.Error:org.freedesktop.DBus.Error.AccessDenied: Permission denied
Traceback:
	mate-power-manager() [0x418b9f]
	mate-power-manager() [0x411220]
	/lib64/libgobject-2.0.so.0(g_type_create_instance+0x1fb) [0x7f342724260b]
	/lib64/libgobject-2.0.so.0(+0x1528d) [0x7f342722628d]
	/lib64/libgobject-2.0.so.0(g_object_new_with_properties+0x27d) [0x7f3427227b3d]
	/lib64/libgobject-2.0.so.0(g_object_new+0xc1) [0x7f3427228521]
	mate-power-manager() [0x411a22]
	mate-power-manager() [0x4080b8]
	/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f3426631555]
	mate-power-manager() [0x4083db]

(nm-applet:8198): nm-applet-WARNING **: 09:05:27.486: NetworkManager is not running
Initializing caja-image-converter extension
Initializing caja-open-terminal extension
/usr/share/system-config-printer/applet.py:44: PyGIWarning: Notify was imported without specifying a version first. Use gi.require_version('Notify', '0.7') before import to ensure that the right version gets loaded.
  from gi.repository import Notify
system-config-printer-applet: failed to start NewPrinterNotification service
system-config-printer-applet: failed to start PrinterDriversInstaller service: org.freedesktop.DBus.Error.AccessDenied: Connection ":1.99089" is not allowed to own the service "com.redhat.PrinterDriversInstaller" due to security policies in the configuration file

I have attached the log file.

I saw the Desktop session that was created by ood 3.0 in ood 2.0 and 1.6 since we are sharing the home directory. I was able to access the existing desktop session from ood 2.0 and ood 1.6 without any issues.

So, it looks like you need to remove some packages for this to work, and I apologize because this isn’t documented anywhere and I can’t find it in our own configs so you could never have known this (and neither did I tbh).

You need to remove:

  • system-config-printer
  • mate-power-manager

Those will cause issues when launching is what I am being told. What happens if you remove those and try to connect again?