Unable to get interactive desktops running

Hello,

I’m new to HPC, so forgive me if I use the wrong terminology. I’ve been trying to get Interactive Desktops running with a Slurm-backed cluster created through AWS’s ParallelCluster tool. The OOD server was made using this AWS workshop.

After getting everything running, I tested the ParallelCluster can run a basic Slurm job when submitting with sbatch from the head node (not OOD). I finally nailed down the cluster name from sacctmgr in my submit.yml.erb and the cluster config file (unfortunately the AWS workshop scripts call the Slurm cluster “cluster” haha).

Here is my cluster config file cluster.yml:

---
v2:
  metadata:
    title: "cluster"
    hidden: false
  login:
    host: "ip-*-*-*" (redacted)
  job:
    adapter: "slurm"
    cluster: "cluster"
    bin: "/bin"
    bin_overrides:
      sbatch: "/etc/ood/config/bin_overrides.py"

And here is my /var/www/ood/apps/sys/bc_desktop/submit.yml.erb:

batch_connect:
  template: vnc
  websockify_cmd: "/usr/local/bin/websockify"
cluster: "cluster"

Now when creating a desktop, the GUI gives me no logs,

but I see this in the ondemand-nginx error logs:

App 456545 output: [2023-10-03 20:48:17 +0000 ]  INFO "execve = [\"git\", \"describe\", \"--always\", \"--tags\"]"
App 456545 output: [2023-10-03 20:48:17 +0000 ]  INFO "method=GET path=/pun/sys/dashboard/batch_connect/sys/bc_desktop/imaging-poc/session_contexts/new format=html controller=BatchConnect::SessionContextsController action=new status=200 duration=28.44 view=17.97"
App 456545 output: [2023-10-03 20:48:19 +0000 ]  INFO "execve = [{}, \"/etc/ood/config/bin_overrides.py\", \"-D\", \"/shared/home/Admin/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/imaging-poc/output/608c5ee3-45ff-441d-96e9-0cf8122129ff\", \"-J\", \"sys/dashboard/sys/bc_desktop/imaging-poc\", \"-o\", \"/shared/home/Admin/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/imaging-poc/output/608c5ee3-45ff-441d-96e9-0cf8122129ff/output.log\", \"-p\", \"desktop\", \"-t\", \"01:00:00\", \"--export\", \"NONE\", \"--parsable\", \"-M\", \"cluster\"]"
App 456545 output: [2023-10-03 20:48:20 +0000 ] ERROR "ERROR: OodCore::JobAdapterError -"
App 456545 output: [2023-10-03 20:48:20 +0000 ]  INFO "execve = [\"git\", \"describe\", \"--always\", \"--tags\"]"
App 456545 output: [2023-10-03 20:48:20 +0000 ]  INFO "method=POST path=/pun/sys/dashboard/batch_connect/sys/bc_desktop/imaging-poc/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=200 duration=874.25 view=14.50"

I’m not sure where to debug this, so hopefully someone here can help. Thanks in advance!

Cheers,
Neel

To debug this I would investigate your wrapper script bin_overrides.py. You can see in the logs you’ve posted the execve command we issue.

Debug by replicating this command on the webhost as your own user. These 2 bits are important - you have to issue the command as the nonprivileged user you would have logged into the UI with and it has to be on the same machine, because well, you’re trying to replicate what OOD did and OOD issued the command on that machine.

So - hop on that machine as your user and play around with bin_overrides.py issuing the same command you see there in the logs.

/etc/ood/config/bin_overrides.py -D /shared/home/Admin/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/imaging-poc/output/608c5ee3-45ff-441d-96e9-0cf8122129ff ...

Also - given it’s your first post I should have said Hi and welcome!

1 Like

So the webhost here refers to the machine running OOD? And if I’m using AD to log in to the UI, does that mean the AD user ran the command?

Weirdly, there’s no runs of bin_overrides.py in ps aux output.

Oh and hello! Thanks for the welcome and the advice!

Yes to all of that.

That’s OK - it’s a 1 time command, not a daemon. It looks to be a replacement for sbatch (the Slurm command to submit a job). It gets issued then returns, so I wouldn’t expect it to still be running.

Also note that I truncated that command for my convenience, there were more parameters in the execve output that you should also supply.

Ahhh I found the ParallelCluster calls itself “imaging-poc”, while OOD knows it as “cluster”. I tried changing the name in /etc/slurm/slurm.conf in OOD but it didn’t update in the Slurm accounting DB. I think the path of least resistance might be redeploying my ParallelCluster with the name “cluster” (as much as it pains me to name it poorly, it’s just a POC)

After redeploying ParallelCluster to be name “cluster” to match what’s in OOD and updating the submit.yml.erb, I’m now running into issues where the cluster “cluster” doesn’t exist.

"ERROR: BatchConnect::Session::ClusterNotFound - Session specifies nonexistent 'cluster' cluster id."

However, sacctmgr list clusters returns:

   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
   cluster       127.0.0.1         6817  9216         1                                                                                           normal

What’s weird is the old imaging-poc name is sticking around in the UI under Clusters, despite removing the file clusters.d/imaging-poc.yml. I added clusters.d/cluster.yml and updated the host inside, but the UI still takes me to the old imaging-poc cluster and IP address (of a terminated instance). Restarting httpd didn’t help with the stale cluster config.

Unsure if there’s a cache issue or another mismatch between cluster names.

Yea you need to ‘restart your webserver’ under the help menu to pick up new configurations like new cluster.d files.

But yes it sounds like you’re on the right track. The name of the file - cluster in this case - will be the name that you reference in your forms.

To add just a little bit of complexity on top of it - Slurm has federation capabilities so you’re able to interact with multiple clusters through 1 config file. That’s where the -M cluster argument is coming into play. And that’s because you have the setting cluster: "cluster". That may not be incorrect, that’s just a heads up on the problem space really and how those CLI flags are being generated.

Excellent, restarting the webserver got the new cluster.d files. I can SSH to the Slurm head node now again, thanks.

le sigh recreated the cluster with ParallelCluster, but the cluster name is still the same despite passing in the name “cluster” so I’m getting the same cluster name mismatch. This may be asking too much, but do you know if I can just use sacctmgr to create a new cluster with the same parameters and the right name? Will this anger the slurm accounting gods?

the docs appear to indicate you can, they give the example sacctmgr create cluster tux.

https://slurm.schedmd.com/sacctmgr.html

It’s no bother to me that you ask, but I’m a little out of my depth here, because even if you can create a new ‘cluster’ you likely need to add all sorts of other stuff to it like accounts & partitions and so on.

Though I don’t actually have any idea - that’s a total off the top guess. I would warn you here not to create new problems trying to solve your existing one, the first rule of holes being: stop digging.

It seems to me that it’s easier to edit a cluster.d yaml than to re-roll a new cluster.

That makes sense, and matches my intuition too. I like the rule of holes :smiley:

So I can’t just edit the cluster.d file, because OOD has the cluster in its Slurm accounting as “cluster”, but the actual Slurm cluster has itself as “imaging-poc” in its Slurm accounting. OOD seems to check its Slurm accounting before continuing to run the interactive desktop, so I can’t pass either name without failing validation (with the correct slurm cluster name) or failing during bin_overrides.py (with what OOD calls the slurm cluster).

So as I indicated here - there are 2 different clusters circling this discussion. One is what you’ve assigned the cluster to OOD. OOD doesn’t do any real validation: it’s really only doing a string comparison.

So if you had a file called clusters.d/my_cool_cluster.yml and the form

cluster: 'my_cool_cluster'

It’s only validating that the string in the form matches a YAML file in clusters.d - at this point OOD does not interact with Slurm in any way - this is completely internal to OOD and indeed doesn’t really matter to Slurm.

Try this configuration - Again the name of the file only matters to OOD, it’s completely internal to it. With this it’ll pass the -M imaging-poc flag to sbatch.

---
v2:
  metadata:
    title: "cluster"
    hidden: false
  login:
    host: "ip-*-*-*" (redacted)
  job:
    adapter: "slurm"
    cluster: "imaging-poc"
    bin: "/bin"
    bin_overrides:
      sbatch: "/etc/ood/config/bin_overrides.py"

OR - get rid of the cluster: "imaging-poc" setting altogether. If it’s set in the slurm.conf sbatch may just recognize it and it’ll all just work out.

Ok, I tried that cluster config file, and it appears to be launching an EC2 instance. It looks like we get past the sbatch command, but now squeue might be failing with a similar issue.

/var/log/ondemand-nginx/Admin/error.log:

App 554822 output: [2023-10-04 19:19:01 +0000 ]  INFO "execve = [{}, \"/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%a\\u001F%A\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\\u001F%b\", \"-j\", \"4\", \"-M\", \"imaging-poc\"]"
App 554822 output: [2023-10-04 19:19:01 +0000 ] ERROR "squeue: error: No cluster 'imaging-poc' known by database.\nsqueue: error: 'imaging-poc' can't be reached now, or it is an invalid entry for --cluster.  Use 'sacctmgr list clusters' to see available clusters."
App 554822 output: [2023-10-04 19:19:01 +0000 ]  INFO "execve = [{}, \"/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%a\\u001F%A\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\\u001F%b\", \"-j\", \"3\", \"-M\", \"imaging-poc\"]"
App 554822 output: [2023-10-04 19:19:01 +0000 ] ERROR "squeue: error: No cluster 'imaging-poc' known by database.\nsqueue: error: 'imaging-poc' can't be reached now, or it is an invalid entry for --cluster.  Use 'sacctmgr list clusters' to see available clusters."
App 554822 output: [2023-10-04 19:19:01 +0000 ]  INFO "method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=184.39 view=3.80"

I think this might be a red herring though, because the desktop EC2 instance starts up, although it eventually fails with the mate desktop with these logs:

Setting VNC password...
Starting VNC server...

Desktop 'TurboVNC: desktop-dy-desktop-cr-1:1 (Admin)' started on display desktop-dy-desktop-cr-1:1

Log file is vnc.log
Successfully started VNC server on desktop-dy-desktop-cr-1:5901...
Script starting...
Starting websocket server...
WebSocket server settings:
  - Listen on :50479
  - No SSL/TLS support (no cert file)
  - Backgrounding (daemon)
Scanning VNC log file for user authentications...
Generating connection YAML file...
cmdTrace.c(713):ERROR:104: 'restore' is an unrecognized subcommand
cmdModule.c(411):ERROR:104: 'restore' is an unrecognized subcommand
Launching desktop 'mate'...
No such schema “org.mate.screensaver”
cat: /etc/xdg/autostart/gnome-keyring-gpg.desktop: No such file or directory
cat: /etc/xdg/autostart/mate-volume-control-applet.desktop: No such file or directory
cat: /etc/xdg/autostart/pulseaudio.desktop: No such file or directory
cat: /etc/xdg/autostart/rhsm-icon.desktop: No such file or directory
cat: /etc/xdg/autostart/spice-vdagent.desktop: No such file or directory
cat: /etc/xdg/autostart/xfce4-power-manager.desktop: No such file or directory
mate-session[6055]: WARNING: Could not parse desktop file /shared/home/Admin/.config/autostart/rhsm-icon.desktop: Key file does not start with a group
mate-session[6055]: GLib-GObject-CRITICAL: object GsmAutostartApp 0xf230b0 finalized while still in-construction
mate-session[6055]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
mate-session[6055]: WARNING: could not read /shared/home/Admin/.config/autostart/rhsm-icon.desktop
mate-session[6055]: WARNING: Could not parse desktop file /shared/home/Admin/.config/autostart/xfce4-power-manager.desktop: Key file does not start with a group
mate-session[6055]: GLib-GObject-CRITICAL: object GsmAutostartApp 0xf23230 finalized while still in-construction
mate-session[6055]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
mate-session[6055]: WARNING: could not read /shared/home/Admin/.config/autostart/xfce4-power-manager.desktop
mate-session[6055]: WARNING: Could not parse desktop file /shared/home/Admin/.config/autostart/spice-vdagent.desktop: Key file does not start with a group
mate-session[6055]: GLib-GObject-CRITICAL: object GsmAutostartApp 0xf232f0 finalized while still in-construction
mate-session[6055]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
mate-session[6055]: WARNING: could not read /shared/home/Admin/.config/autostart/spice-vdagent.desktop
mate-session[6055]: WARNING: Could not parse desktop file /shared/home/Admin/.config/autostart/mate-volume-control-applet.desktop: Key file does not start with a group
mate-session[6055]: GLib-GObject-CRITICAL: object GsmAutostartApp 0xf23530 finalized while still in-construction
mate-session[6055]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
mate-session[6055]: WARNING: could not read /shared/home/Admin/.config/autostart/mate-volume-control-applet.desktop
mate-session[6055]: WARNING: Could not parse desktop file /shared/home/Admin/.config/autostart/pulseaudio.desktop: Key file does not start with a group
mate-session[6055]: GLib-GObject-CRITICAL: object GsmAutostartApp 0xf235f0 finalized while still in-construction
mate-session[6055]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
mate-session[6055]: WARNING: could not read /shared/home/Admin/.config/autostart/pulseaudio.desktop
mate-session[6055]: WARNING: Could not parse desktop file /shared/home/Admin/.config/autostart/gnome-keyring-gpg.desktop: Key file does not start with a group
mate-session[6055]: GLib-GObject-CRITICAL: object GsmAutostartApp 0xf235f0 finalized while still in-construction
mate-session[6055]: GLib-GObject-CRITICAL: Custom constructor for class GsmAutostartApp returned NULL (which is invalid). Please use GInitable instead.
mate-session[6055]: WARNING: could not read /shared/home/Admin/.config/autostart/gnome-keyring-gpg.desktop
mate-session[6055]: EggSMClient-WARNING: Invalid Version string '2023.0.15487' in /etc/xdg/autostart/dcvagentlauncher.desktop
mate-session[6055]: WARNING: Unable to find provider '' of required component 'dock'
Window manager warning: Failed to load theme "Menta": Failed to find a valid file for theme Menta

Window manager warning: Failed to load theme "Simple": Failed to find a valid file for theme Simple

Window manager warning: Failed to load theme "Default": Failed to find a valid file for theme Default

Window manager warning: Failed to load theme "Emacs": Failed to find a valid file for theme Emacs

Window manager warning: Failed to load theme "Raleigh": Failed to find a valid file for theme Raleigh

mate-session[6055]: CRITICAL: gsm_systemd_set_session_idle: assertion 'session_path != NULL' failed

If I wanted to use xfce, what would I set in my desktop app config? I could be wrong, but I think the documentation only shows an example for v1. Is this the way?

v2:
  attributes:
    desktop: xfce
  ...

I really appreciate you helping me out, I’ve been working on this for almost two weeks now. You’ve been so helpful for me learning OOD and Slurm!

the v2 (or v1 for that matter) is a cluster.d file thing - it shouldn’t be there for application forms. You can take a look at our OSC production desktop configurations here.

The MATE issue may be this?

I switched to using xfce since that’s what the team uses (although I’m sure they’d be fine with mate as well). After installing more packages, now I’m getting this error. Seems like I’m missing some package, since mate had a similar error as well.

Setting VNC password...
Starting VNC server...

Desktop 'TurboVNC: desktop-dy-desktop-cr-1:1 (Admin)' started on display desktop-dy-desktop-cr-1:1

Log file is vnc.log
Successfully started VNC server on desktop-dy-desktop-cr-1:5901...
Script starting...
Starting websocket server...
WebSocket server settings:
  - Listen on :58526
  - No SSL/TLS support (no cert file)
  - Backgrounding (daemon)
Scanning VNC log file for user authentications...
Generating connection YAML file...
cmdTrace.c(713):ERROR:104: 'restore' is an unrecognized subcommand
cmdModule.c(411):ERROR:104: 'restore' is an unrecognized subcommand
Launching desktop 'xfce'...

(xfce4-session:15840): xfce4-session-WARNING **: 15:15:29.720: xfsm_manager_load_session: Something wrong with /shared/home/Admin/.cache/sessions/xfce4-session-desktop-dy-desktop-cr-1:1, Does it exist? Permissions issue?

(xfsettingsd:15876): GLib-CRITICAL **: 15:15:30.139: g_str_has_prefix: assertion 'prefix != NULL' failed

(xfsettingsd:15876): GLib-GObject-CRITICAL **: 15:15:30.140: g_value_get_string: assertion 'G_VALUE_HOLDS_STRING (value)' failed

(xfsettingsd:15876): GLib-GObject-CRITICAL **: 15:15:30.141: g_value_get_string: assertion 'G_VALUE_HOLDS_STRING (value)' failed

(xfwm4:15872): GLib-CRITICAL **: 15:15:31.702: g_str_has_prefix: assertion 'prefix != NULL' failed

(xfsettingsd:15876): xfsettingsd-WARNING **: 15:15:31.716: Failed to get the _NET_NUMBER_OF_DESKTOPS property.

(xfwm4:15872): xfwm4-WARNING **: 15:15:31.741: The property '/general/double_click_distance' of type int is not supported

** (xfdesktop:15881): WARNING **: 15:15:31.784: Thumbnailer failed calling GetFlavors

(xfwm4:15872): xfwm4-WARNING **: 15:15:31.792: Error opening /dev/dri/card0: No such file or directory

(xfwm4:15872): xfwm4-WARNING **: 15:15:32.213: Another compositing manager is running on screen 0
slurmstepd: error: *** JOB 9 ON desktop-dy-desktop-cr-1 CANCELLED AT 2023-10-05T16:15:18 DUE TO TIME LIMIT ***

Does this seem familiar?

Yes - but not in a bad way. I see a lot of those errors a lot and they can be ignored AFAIK.

From what I can tell from that log - it all went OK. The job ended due to a time limit. What’s the user experience when you try to connect to it? I mean - in what way is it failing?

And while I’m wondering about the issue, I’d ask if you have these configurations set in ood_portal.yml.

rnode_uri: '/rnode'
node_uri: '/node'

Ah gotcha. Those CRITICAL logs lines disappeared recently, must’ve been something I installed. These latest logs look good.

Setting VNC password...
Starting VNC server...

Desktop 'TurboVNC: desktop-dy-desktop-cr-1:2 (Admin)' started on display desktop-dy-desktop-cr-1:2

Log file is vnc.log
Successfully started VNC server on desktop-dy-desktop-cr-1:5902...
Script starting...
Starting websocket server...
cmdTrace.c(713):ERROR:104: 'restore' is an unrecognized subcommand
cmdModule.c(411):ERROR:104: 'restore' is an unrecognized subcommand
Launching desktop 'xfce'...
WebSocket server settings:
  - Listen on :40514
  - No SSL/TLS support (no cert file)
  - Backgrounding (daemon)
Scanning VNC log file for user authentications...
Generating connection YAML file...

(xfce4-session:18207): xfce4-session-WARNING **: 15:49:49.909: xfsm_manager_load_session: Something wrong with /shared/home/Admin/.cache/sessions/xfce4-session-desktop-dy-desktop-cr-1:2, Does it exist? Permissions issue?

(xfwm4:18239): xfwm4-WARNING **: 15:49:49.992: Error opening /dev/dri/card0: No such file or directory

** (xfdesktop:18248): WARNING **: 15:49:50.211: Thumbnailer failed calling GetFlavors

I guess my issue now is that OOD thinks the interactive desktop is erroring out and I can’t connect to it.
Does a button pop up on this screen to connect, or am I just missing an obvious way to connect to the VNC server?