Desktop Session Entering in the Bad State

The desktop session keeps entering a bad state. Sharing the output logs below for reference.

cat /shared/home/ram500/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/ood-pcluster/output/0f7e9e60-91fa-42a9-ad40-205e0fa413a2/output.log
Setting VNC password...
Starting VNC server...

Desktop 'TurboVNC: desktop-st-desktop-cr-1:2 (ram500)' started on display desktop-st-desktop-cr-1:2

Log file is vnc.log
Successfully started VNC server on desktop-st-desktop-cr-1.ood-pcluster.pcluster:5902...
Script starting...
Starting websocket server...
cmdTrace.c(713):ERROR:104: 'restore' is an unrecognized subcommand
cmdModule.c(411):ERROR:104: 'restore' is an unrecognized subcommand
Launching desktop 'mate'...
No such schema "org.mate.screensaver"
[websockify]: pid: 16759 (proxying 49832 ==> localhost:5902)
[websockify]: log file: ./websockify.log
[websockify]: waiting ...
mate-session[16790]: EggSMClient-WARNING: Invalid Version string '2023.1.16388' in /etc/xdg/autostart/dcvagentlauncher.desktop
mate-session[16790]: WARNING: Unable to find provider '' of required component 'dock'
[websockify]: started successfully (proxying 49832 ==> localhost:5902)
Scanning VNC log file for user authentications...
Generating connection YAML file...
mate-session[16790]: CRITICAL: gsm_systemd_set_session_idle: assertion 'session_path != NULL' failed

vnc.log

TurboVNC Server (Xvnc) 64-bit v3.1 (build 20231117)
Copyright (C) 1999-2023 The VirtualGL Project and many others (see README.md)
Visit http://www.TurboVNC.org for more information on TurboVNC

06/06/2025 14:02:01 Using security configuration file /etc/turbovncserver-security.conf
06/06/2025 14:02:01 Enabled security type 'tlsvnc'
06/06/2025 14:02:01 Enabled security type 'tlsotp'
06/06/2025 14:02:01 Enabled security type 'tlsplain'
06/06/2025 14:02:01 Enabled security type 'x509vnc'
06/06/2025 14:02:01 Enabled security type 'x509otp'
06/06/2025 14:02:01 Enabled security type 'x509plain'
06/06/2025 14:02:01 Enabled security type 'vnc'
06/06/2025 14:02:01 Enabled security type 'otp'
06/06/2025 14:02:01 Enabled security type 'unixlogin'
06/06/2025 14:02:01 Enabled security type 'plain'
06/06/2025 14:02:01 Desktop name 'TurboVNC: desktop-st-desktop-cr-1:2 (ram500)' (desktop-st-desktop-cr-1:2)
06/06/2025 14:02:01 Protocol versions supported: 3.3, 3.7, 3.8, 3.7t, 3.8t
06/06/2025 14:02:01 Listening for VNC connections on TCP port 5902
06/06/2025 14:02:01   Interface 0.0.0.0
06/06/2025 14:02:01 Framebuffer: BGRX 8/8/8/8
06/06/2025 14:02:01 New desktop size: 800 x 600
06/06/2025 14:02:01 New screen layout:
06/06/2025 14:02:01   0x00000040 (output 0x00000040): 800x600+0+0
06/06/2025 14:02:01 Maximum clipboard transfer size: 1048576 bytes
06/06/2025 14:02:01 VNC extension running!

websockify.log

WebSocket server settings:
  - Listen on :49832
  - No SSL/TLS support (no cert file)
  - proxying from :49832 to localhost:5902

Hi and welcome!

What do you mean by bad state? Is this a scheduler that puts it into a bad state?

Hey, thx for the quick response!
I’m sharing a screenshot of the output from the portal.

We’ve already checked the Slurmctld and Slurmdbd logs, but didn’t find anything relevant related to the issue. The Slurmctld logs do show that the job was submitted successfully, and it also appears in the job list when running squeue.

OK I see it - you can tell by the card in yellow what’s wrong. I assume your cluster is using some form of submit_host to ssh into while issueing the sbatch command.

We rely heavily on parsing stderr and stdout for the job id/and or error output after issing sbatch.

It appears that we did not anticipate this string being in standard out. “This string” being ssh output about adding a host to the list of known hosts.

From what I can tell, you’re a submitting the job correctly, but we’re not able to correctly parse the job id because of this ssh output about adding that host to the list of known hosts. Seems like maybe you need to redirect ssh’s stdout & stderr while keeping sbatch’s stdout & stderr.

Fixed the issue, the job ID is now appearing properly.


However, I found the following error in the ondemand-nginx/ram500/error.log:

ERROR "slurm_load_jobs error: Unable to contact slurm controller (connect failure)"

Yes that looks much better - what’s in the parenthesis () should be the job id.

Not sure why you wouldn’t be able to contact the slurm controller, but you can issue squeue -j 191 manually on the same machine (or ssh somesubmithost squeue -j 191) to replicate.

It’s important that you issue a similar command on the same exact machine as OOD does to replicate and debug your slurm controller error.

Also there may be something on the slurm controller’s side in it’s logs that indicates what the issue could be.

Hey,
Thanks for pointing in the right direction!

Here are our observations so far:

  • The Slurm controller appears to be running both on the OOD instance and the Head Node, and both seem to be functioning independently.
  • When we submit a job on the OOD instance using sbatch --wrap="hostname", the job doesn’t appear on the Login, Head, or Compute nodes — and the same happens in reverse.
  • We’ve also verified connectivity from the OOD instance to the Head Node on port 6819, where the Slurm controller is currently running.

PS: We resolved the “Unable to contact Slurm controller (connect failure)” issue by removing the additional Slurm instance running on the OOD server and updating the Slurm configuration to point to the Head Node.

You want to think of the OOD machine as a login node. Essentially you just want sbatch and squeue to work as they would through the CLI. It doesn’t need to be a controller node, just one that can submit jobs through sbatch and query jobs through squeue.

Yeah, We figured this out the squeue command on the OOD instance wasn’t showing jobs from the ParallelCluster, just as you mentioned earlier. It was pointing to the Slurm controller running locally on the OOD instance instead of the one on the Head Node.

Thanks a lot for the support, really appreciate it!

1 Like