Desktop not leveraging Virtualgl - NVIDIA a30 GPU on worker

Hi Folks,

I am stuck; cannot figure out how to solved this issue. I have desktop running with the vnc server fine when I don’t use the GPU but I have cases where some apps need to use the GPU to do some of the work at the backend to reduce front end load.

I have install virtualgl to try make this work but it fails.

How have I set this up?

I ran the virtualgl config utility which created a virtualgl group. I then added users to this group
I have tried all of the following:

$USER/ondemand/dev/bc_desktop/template/desktops/xfce.sh

/usr/bin/vglrun -d /dev/dri/card1  +v xfce4-session
and tried:
/usr/bin/vglrun +v xfce4-session
and tried:
/opt/TurboVNC/bin/vncserver -xstartup /usr/bin/vglrun startxfce4

output.log

Setting VNC password...
Starting VNC server...

Desktop 'TurboVNC: gpu002.meerkat.mcri.edu.au:1 (chris.welsh)' started on display gpu002.meerkat.mcri.edu.au:1

Log file is vnc.log
Successfully started VNC server on gpu002.meerkat.mcri.edu.au:5901...
Script starting...
Starting websocket server...
Launching desktop 'xfce'...
[websockify]: pid: 103808 (proxying 42992 ==> localhost:5901)
[websockify]: log file: ./websockify.log
[websockify]: waiting ...
[VGL] Shared memory segment ID for vglconfig: 98361
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] Opening connection to 3D X server :0
[VGL] ERROR: Could not open display :0.
[VGL] Shared memory segment ID for vglconfig: 98362
Desktop 'xfce' ended with 1 status...
[websockify]: started successfully (proxying 42992 ==> localhost:5901)
Scanning VNC log file for user authentications...
Generating connection YAML file...
Cleaning up...
Killing Xvnc process ID 103784

Here’s the drivers setup:

[root@gpu002 dri]# ls -l
total 0
drwxr-xr-x. 2 root root          100 Sep  8 20:31 by-path
crw-rw----. 1 root vglusers 226,   0 Sep  8 20:31 card0
crw-rw----. 1 root vglusers 226,   1 Sep  8 20:31 card1
crw-rw----. 1 root vglusers 226, 128 Sep  8 20:31 renderD128

Here is the Nvidia-smi command:

[root@gpu002 dri]# nvidia-smi
Tue Sep  9 14:11:37 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     Off |   00000000:21:00.0 Off |                    0 |
| N/A   31C    P0             30W /  165W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

lsmod | grep Nvidia

nvidia_drm            143360  0
nvidia_modeset       1413120  1 nvidia_drm
nvidia_uvm           3895296  0
nvidia              70705152  2 nvidia_uvm,nvidia_modeset

This GPU server works fine with normal slurm jobs allocated to it. But I want to test to see if I can leverage it for OOD GPU optimised desktops.

Note: I also tried to run virtualgl once the desktop is started (without trying to start it with virtualgl, i.e. just “xfce4-session”) and got the following:

Note from the above pic with in the desktop I cannot open a display
Note also that I cannot run gxlinfo I cannot open a display

Below you will see it complaining about ERROR: in init3D (EGL). does not matter if I specify /dev/dri/card0 or /dev/dri/card1

(base) [chris.welsh@gpu002 ~]$ vglrun -d /dev/dri/card1 +v glxinfo
[VGL] Shared memory segment ID for vglconfig: 131114
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
name of display: :1.0
[VGL] Opening EGL device /dev/dri/card1
[VGL] ERROR: in init3D--
[VGL]    214: No EGL devices found
[VGL] Shared memory segment ID for vglconfig: 131115
(base) [chris.welsh@gpu002 ~]$ 


Final observations is that I can run “nvidia-smi” when I directly ssh into the server with my account, but cannot when use the OOD desktop

(base) [chris.welsh@gpu002 ~]$ nvidia-smi
Failed to initialize NVML: Insufficient Permissions
(base) [chris.welsh@gpu002 ~]$ 

Any steps you can give me to work this through would be welcome. Oh, and selinux is set to permissive

..wasn’t actually necessary to configure/enable EGL in some cli tool like vglconfig? I’m looking up my ansible recipes if I see something related.. I did this setup years ago and it works since than..

Did Slurm (or the relevant scheduler) allocate the device for you during these jobs?

I notice you run nvidia-smi as root you can see it, but when you run as a regular user you get Insufficient Permissions. I don’t believe we use Unix groups to control access to those devices.

When I run nvidia-smi as a regular user without access to a GPU I get No devices found. Our devices have these permissions: crw-rw-rw- 1 root root.

Also, when you run nvidia-smi it only reports 1 device - but you have 2 of them there in /dev, could there be something to that?

Thx so much! Looking forward to finding “that one little thing that I missed” :slight_smile:

Hi Jeff,

Hou Q:

Did Slurm (or the relevant scheduler) allocate the device for you during these jobs?

Would I need to specify goes resources and GPUs in the job submission. I did not. So perhaps that is it. Does someone have a sample of what you would specify in the form.yml, submit, etc? Thx

I can run Nvidia-smi on the host as a normal user when I login. See below. But cannot in the OOD Desktop terminal window. I had a thought, perhaps my normal Nash login env is not going across tp this terminal. Is it supposed to?

Example from normal ssh login to the gpu server. A. As root, B. as a standard user. both work.
Note only one device lists so Jeff, I have no idea why Card0 and Card1 exist. :thinking:

[root@gpu002 ~]# nvidia-smi
Wed Sep 10 10:13:50 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     Off |   00000000:21:00.0 Off |                    0 |
| N/A   31C    P0             30W /  165W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
[root@gpu002 ~]# su - chris.welsh
Last login: Tue Sep  9 16:25:58 AEST 2025
(base) [chris.welsh@gpu002 ~]$ nvidia-smi
Wed Sep 10 10:14:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     Off |   00000000:21:00.0 Off |                    0 |
| N/A   31C    P0             30W /  165W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Hi All, I have added the following:

I have set the following up in the submit.yml.

---
batch_connect:
  template: vnc
  script:
    native:
      #- "--gres=gpu:a30:1" # I also tried this with no difference ****
      - "--nodes=1"                       # Number of nodes
      - "--gpus=1"
      - "--ntasks=1"                      # Number of tasks (typically 1 per node)
      - "--gpus-per-task=1"          # number of gpus per task

Here is the output log From “/usr/bin/vglrun -d /dev/dri/card1 +v xfce4-session” which is executed from xfce.sh in the “/home/chris.welsh/ondemand/dev/bc_desktop/template/desktops” dir.

Note: the permissions issue raised in the section below? I am running this as the me (a standard cluster user who outside of OOD has no issue getting and running stuff on the this gpu.

Setting VNC password...
Starting VNC server...

Desktop 'TurboVNC: gpu002.meerkat.mcri.edu.au:1 (chris.welsh)' started on display gpu002.meerkat.mcri.edu.au:1

Log file is vnc.log
Successfully started VNC server on gpu002.meerkat.mcri.edu.au:5901...
Script starting...
Starting websocket server...
Launching desktop 'xfce'...
[websockify]: pid: 280476 (proxying 4773 ==> localhost:5901)
[websockify]: log file: ./websockify.log
[websockify]: waiting ...
grep: ./websockify.log: No such file or directory
[VGL] Shared memory segment ID for vglconfig: 327682
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
/usr/bin/iceauth:  creating new authority file /run/user/37738/ICEauthority
[VGL] Shared memory segment ID for vglconfig: 327683
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.

(xfwm4:280529): xfwm4-WARNING **: 11:23:01.732: GLX extension missing, GLX support disabled.
[VGL] Shared memory segment ID for vglconfig: 327687
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
[VGL] Shared memory segment ID for vglconfig: 327688
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
[VGL] Shared memory segment ID for vglconfig: 327689
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
[VGL] Shared memory segment ID for vglconfig: 327690
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
[VGL] Shared memory segment ID for vglconfig: 327698
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
[VGL] Shared memory segment ID for vglconfig: 327699
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
[VGL] Shared memory segment ID for vglconfig: 327700
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
[VGL] Shared memory segment ID for vglconfig: 327701
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
[VGL] Shared memory segment ID for vglconfig: 327703
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
[VGL] Shared memory segment ID for vglconfig: 327704
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
[websockify]: started successfully (proxying 4773 ==> localhost:5901)
Scanning VNC log file for user authentications...
Generating connection YAML file...
[VGL] Shared memory segment ID for vglconfig: 327705
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] Shared memory segment ID for vglconfig: 327706
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.

ERROR: The current user does not have permission for operation

[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
[VGL] Shared memory segment ID for vglconfig: 327707
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.
[VGL] Shared memory segment ID for vglconfig: 327713
[VGL] VirtualGL v3.1.3 64-bit (Build 20250409)

** (wrapper-2.0:280577): WARNING **: 11:23:02.756: No outputs have backlight property
[VGL] NOTICE: Replacing dlopen("libGLX.so.1") with dlopen("libvglfaker.so")
[VGL] WARNING: The EGL back end requires a 2D X server with a GLX extension.

(wrapper-2.0:280576): libnotify-WARNING **: 11:23:02.858: Failed to connect to proxy

(wrapper-2.0:280577): Gtk-CRITICAL **: 11:23:02.869: gtk_icon_theme_has_icon: assertion 'icon_name != NULL' failed

(wrapper-2.0:280577): Gtk-CRITICAL **: 11:23:02.881: gtk_icon_theme_has_icon: assertion 'icon_name != NULL' failed

(wrapper-2.0:280577): Gtk-CRITICAL **: 11:23:02.881: gtk_icon_theme_has_icon: assertion 'icon_name != NULL' failed

(wrapper-2.0:280577): Gtk-CRITICAL **: 11:23:02.940: gtk_icon_theme_has_icon: assertion 'icon_name != NULL' failed

** (xfdesktop:280557): WARNING **: 11:23:06.325: Failed to register the newly set background with AccountsService '/usr/share/backgrounds/xfce/xfce-leaves.svg': GDBus.Error:org.freedesktop.DBus.Error.InvalidArgs: No such interface “org.freedesktop.DisplayManager.AccountsService”

(wrapper-2.0:280576): pulseaudio-plugin-WARNING **: 11:23:09.027: Disconnected from the PulseAudio server. Attempting to reconnect in 5 seconds...
Failed to create secure directory (/run/user/37738/pulse): No such file or directory

Here is what I get when I run the desktop.

sacct command:

sa582127       sys/dashboard/dev/bc_desktop                                 2025-09-10T11:22:54   00:17:46          GPU_SHORT      itec1          1             RUNNING      0:0 gpu002  
[root@login001 ~]# scontrol show partition GPU_SHORT
PartitionName=GPU_SHORT
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=00:40:00 DisableRootJobs=NO ExclusiveUser=NO ExclusiveTopo=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=02:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED MaxCPUsPerSocket=UNLIMITED
   Nodes=gpu002
   PriorityJobFactor=0 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=8192 MaxMemPerNode=UNLIMITED
   TRES=cpu=64,mem=515024M,node=1,billing=64,gres/gpu=1,gres/gpu:a30=1

/var/log/slurm/slurmd.log on gpu002

[2025-09-10T11:45:38.901] Launching batch job 582134 for UID 37738
[2025-09-10T11:45:38.918] [582134.batch] task/cgroup: _memcg_initialize: job: alloc=8192MB mem.limit=8192MB memsw.limit=8192MB job_swappiness=18446744073709551614
[2025-09-10T11:45:38.918] [582134.batch] task/cgroup: _memcg_initialize: step: alloc=8192MB mem.limit=8192MB memsw.limit=8192MB job_swappiness=18446744073709551614

As far as the virtualgl component it must be not using the GPU on this host>

From my ssh session outside OOD to the gpu002 server

(base) [chris.welsh@gpu002 bc_desktop]$ nvidia-smi
Wed Sep 10 11:56:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     Off |   00000000:21:00.0 Off |                    0 |
| N/A   31C    P0             30W /  165W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

So scratching my head - Permissions? Anything I can try to dig deeper? Missed anything? Thx.

OK to sum up the situation

  • you’ve got a job in a GPU queue (it has a gpu in the TRES)
  • But you’re unable to interact with it from within the Slurm job
  • You are able to interact with it if you ssh to the node.

I guess I’d ask from within the job itself, do you have the right CUDA/SLURM environment variables set? I.e., CUDA_VISIBLE_DEVICES or similar.

@tdockendorf is there anything here you may spot that I can’t?

Did the job actually request GPUs? You have to do something like this from within the job:

scontrol show job=$SLURM_JOB_ID

A Slurm job can only see the GPUs if the job requested at least 1 GPU. Slurm uses cgroups to limit access to GPU devices so if a job doesn’t request GPUs, the GPUs are not visible but would be visible for SSH as it’s likely your SSH session is not within the job cgroup where GPUs are not accessible.

Yea you can see they request --gpus=1 and are put in that queue that has GPUs for a TRES.

Interestingly though the slurmd.log does not mention the GPU.

I don’t believe slurmd.log would mention the GPU usage.

From within the job do this:

scontrol show job=$SLURM_JOB_ID
echo $CUDA_VISIBLE_DEVICES

This will help show that the job was actually allocated a GPU. The CUDA_VISIBLE_DEVICES is also set by Slurm for GPU jobs so if that’s not set then the job likely wasn’t allocated a GPU.

I must admit that didn’t see anything particular in my ansible books, in my install notes I saw this command:

[root@gpu2 ~]# vglserver_config 

1) Configure server for use with VirtualGL (GLX + EGL back ends)
2) Unconfigure server for use with VirtualGL (GLX + EGL back ends)
3) Configure server for use with VirtualGL (EGL back end only)
4) Unconfigure server for use with VirtualGL (EGL back end only)
X) Exit

Choose:
3

Restrict framebuffer device access to vglusers group (recommended)?
[Y/n]
n
... Modifying /etc/security/console.perms to disable automatic permissions
    for DRI devices ...
... Creating /etc/modprobe.d/virtualgl.conf to set requested permissions for
    /dev/nvidia* ...
... Attempting to remove nvidia module from memory so device permissions
    will be reloaded ...
... Granting write permission to /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3 /dev/nvidia-caps /dev/nvidiactl /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools for all users ...
... Granting write permission to /dev/dri/card0 /dev/dri/card1 /dev/dri/card2 /dev/dri/card3 /dev/dri/card4 for all users ...
... Granting write permission to /dev/dri/renderD128 /dev/dri/renderD129 /dev/dri/renderD130 /dev/dri/renderD131 for all users ...

1) Configure server for use with VirtualGL (GLX + EGL back ends)
2) Unconfigure server for use with VirtualGL (GLX + EGL back ends)
3) Configure server for use with VirtualGL (EGL back end only)
4) Unconfigure server for use with VirtualGL (EGL back end only)
X) Exit

Choose:
X
[root@gpu2 ~]#

I know this might be a silly question, but did you restart the machine after modifying the groups? Once a process is already running, depending on how group and user resolution is handled, the new group assignments may or may not be visible within given process..

..indeed as others mentioned, are gres devices managed by slurm? If you create interactive CLI slurm sessison, can you run nvidia-smi as unprivileged user?

Here is the output. - Do. you see any GPU devices reserved. I’m not seeing any so hmm… even though I specify and submit with the GPU devices they don’t show up here. I will check the submit.yml has no format errors. Thoughts, Suggestions welcome.

(base) [chris.welsh@gpu002 ~]$ scontrol show job=590957
JobId=590957 JobName=sys/dashboard/dev/bc_desktop
UserId=chris.welsh(37738) GroupId=chris.welsh.dg(41023) MCS_label=N/A
Priority=1 Nice=0 Account=itec1 QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:52 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2025-09-11T13:19:04 EligibleTime=2025-09-11T13:19:04
AccrueTime=2025-09-11T13:19:04
StartTime=2025-09-11T13:19:04 EndTime=2025-09-11T14:19:04 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-09-11T13:19:04 Scheduler=Main
Partition=GPU_SHORT AllocNode:Sid=172.16.12.16:1032179
ReqNodeList=(null) ExcNodeList=(null)
NodeList=gpu002
BatchHost=gpu002
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0::
ReqTRES=cpu=1,mem=8G,node=1,billing=1
AllocTRES=cpu=1,mem=8G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:: CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=8G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/home/chris.welsh/ondemand/data/sys/dashboard/batch_connect/dev/bc_desktop/output/c3a55a5f-35c4-45da-82a0-5f24c84aa6e3
StdErr=/home/chris.welsh/ondemand/data/sys/dashboard/batch_connect/dev/bc_desktop/output/c3a55a5f-35c4-45da-82a0-5f24c84aa6e3/output.log
StdIn=/dev/null
StdOut=/home/chris.welsh/ondemand/data/sys/dashboard/batch_connect/dev/bc_desktop/output/c3a55a5f-35c4-45da-82a0-5f24c84aa6e3/output.log

Hi Jeff,

I thought Slurm automatically figures this out without assigning GPUs to CUDA_VISIBLE_DEVICES?
Trying to identify what I have missed there. Any thoughts?
Quote: “Dr. Google”

Slurm automatically manages the CUDA_VISIBLE_DEVICES environment variable for jobs requesting GPUs. When a job is allocated GPU resources, Slurm sets CUDA_VISIBLE_DEVICES to reflect the GPUs assigned to that specific job or task.

How Slurm Sets CUDA_VISIBLE_DEVICES:

  • GPU Request: When submitting a Slurm job, you request GPUs using options like --gres=gpu:N (for N GPUs per node) or --gpus-per-task=N (for N GPUs per task).

  • Allocation and Mapping: Slurm allocates specific physical GPUs on a node to your job or tasks.

  • Environment Variable Setting: Slurm then sets CUDA_VISIBLE_DEVICES within the job’s environment. The values in this variable are typically re-indexed from 0 within the context of the job, regardless of the physical IDs of the allocated GPUs. For example, if you request 2 GPUs, CUDA_VISIBLE_DEVICES will likely be set to 0,1 within your job, even if those correspond to physical GPUs 2 and 5 on the node.

Hi. J,

Tried the following the Nvidia-smi works from a standard slurm interactive session to gpu002

[root@login001 ~]# srun -p GPU_SHORT --job-name="InteractiveJob" --cpus-per-task=8 --mem-per-cpu=1500M --gres=gpu --gpus=1 --time=2:00:00 --pty bash
[root@gpu002 ~]# nvidia-smi
Thu Sep 11 14:56:45 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     Off |   00000000:21:00.0 Off |                    0 |
| N/A   31C    P0             30W /  165W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I updated the submit.yml, tried but failed. Here is the file contents

Desktop output:

[root@login001 ~]# scontrol show job 591046
JobId=591046 JobName=sys/dashboard/dev/bc_desktop
   UserId=chris.welsh(37738) GroupId=chris.welsh.dg(41023) MCS_label=N/A
   Priority=1 Nice=0 Account=itec1 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:09:10 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2025-09-11T14:55:38 EligibleTime=2025-09-11T14:55:38
   AccrueTime=2025-09-11T14:55:38
   StartTime=2025-09-11T14:55:38 EndTime=2025-09-11T15:55:38 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-09-11T14:55:38 Scheduler=Main
   Partition=GPU_SHORT AllocNode:Sid=172.16.12.16:1032179
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=gpu002
   BatchHost=gpu002
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=8G,node=1,billing=1
   AllocTRES=cpu=1,mem=8G,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=8G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/chris.welsh/ondemand/data/sys/dashboard/batch_connect/dev/bc_desktop/output/316529a9-f9ca-4612-a058-942ae3a26252
   StdErr=/home/chris.welsh/ondemand/data/sys/dashboard/batch_connect/dev/bc_desktop/output/316529a9-f9ca-4612-a058-942ae3a26252/output.log
   StdIn=/dev/null
   StdOut=/home/chris.welsh/ondemand/data/sys/dashboard/batch_connect/dev/bc_desktop/output/316529a9-f9ca-4612-a058-942ae3a26252/output.log

Hmm…

Just to be sure - I restarted the GPU worker node and OOD server. So at least for the GPU node we know that any virtualgl group changes have 100% been applied. Unfortunately still no joy.

The job has no GPU allocated.

It’s as if the –-gpus=1flag from batch isn’t being passed. Can you find the batch job script, likely in a place like /home/chris.welsh/ondemand/data/sys/dashboard/batch_connect/dev/bc_desktop/output/316529a9-f9ca-4612-a058-942ae3a26252 and share the contents?

Hi, I think you may be on to some thing here. Here is the job info from that dir

/home/chris.welsh/ondemand/data/sys/dashboard/batch_connect/dev/bc_desktop/output/316529a9-f9ca-4612-a058-942ae3a26252



{
  "job_name": "sys/dashboard/dev/bc_desktop",
  "workdir": "/home/chris.welsh/ondemand/data/sys/dashboard/batch_connect/dev/bc_desktop/output/316529a9-f9ca-4612-a058-942ae3a26252",
  "output_path": "/home/chris.welsh/ondemand/data/sys/dashboard/batch_connect/dev/bc_desktop/output/316529a9-f9ca-4612-a058-942ae3a26252/output.log",
  "shell_path": "/bin/bash",
  "wall_time": 3600,
  "native": [
    "-N",
    1
  ],
  "queue_name": "GPU_SHORT",
  "email_on_started": false
}
job_script_options.json (END)


Perhaps I have something a miss here?

---
batch_connect:
  template: vnc
  script:
    native:
      - "--gres=gpu" # Requesting one GPU
      - "--nodes=1"                       # Number of nodes
      - "--gpus=1"
      - "--ntasks=1"                      # Number of tasks (typically 1 per node)
      - "--gpus-per-task=1"          # number of gpus per task

Or here?

---
cluster: "meerkat"
attributes:
  desktop:
    label: "Desktop Environment"
    widget: select
    options:
      - "gnome"
      - "kde"
      - "mate"
      - "xfce"
  bc_vnc_idle: 0
  bc_vnc_resolution:
    required: true
  node_type: null

form:
  - bc_vnc_idle
  - desktop
  - bc_account
  - bc_num_hours
  - bc_num_slots
  - node_type
  - bc_queue
  - bc_vnc_resolution
  - bc_email_on_started

oh and here is the log. Yep nothing looks to be getting passed on.

App 12609 output: [2025-09-11 15:42:51 +1000 ]  INFO "execve = [{}, \"/usr/bin/sbatch\", \"-D\", \"/home/chris.welsh/ondemand/data/sys/dashboard/batch_connect/dev/bc_desktop/output/ebf79a5c-afbd-4f00-8a03-b22679
958923\", \"-J\", \"sys/dashboard/dev/bc_desktop\", \"-o\", \"/home/chris.welsh/ondemand/data/sys/dashboard/batch_connect/dev/bc_desktop/output/ebf79a5c-afbd-4f00-8a03-b22679958923/output.log\", \"-p\", \"GPU_SH
ORT\", \"-t\", \"01:00:00\", \"--export\", \"NONE\", \"-N\", \"1\", \"--parsable\", \"-M\", \"meerkat\"]"

And here is the directory listing with permissions of form, etc.

bc_desktop/:
total 6
-rw-r--r--. 1 chris.welsh chris.welsh.dg 2386 Jul 11 07:11 CHANGELOG.md
-rw-r--r--. 1 chris.welsh chris.welsh.dg  403 Aug 31 13:59 form.yml
-rw-r--r--. 1 chris.welsh chris.welsh.dg 1087 Jul 11 07:11 LICENSE.txt
-rw-r--r--. 1 chris.welsh chris.welsh.dg  312 Jul 11 07:11 manifest.yml
-rw-r--r--. 1 chris.welsh chris.welsh.dg  192 Jul 11 07:11 README.md
-rw-r--r--. 1 chris.welsh chris.welsh.dg  166 Sep 12 10:31 submit.yml.erb
drwxr-xr-x. 2 chris.welsh chris.welsh.dg 4096 Sep  9 11:16 template

bc_desktop/template:
total 2
-rwxr-xr-x. 1 chris.welsh chris.welsh.dg  100 Sep  9 10:58 before.sh.erb
drwxr-xr-x. 2 chris.welsh chris.welsh.dg 4096 Sep 11 13:36 desktops
-rwxr-xr-x. 1 chris.welsh chris.welsh.dg 1424 Sep  9 10:57 script.sh.erb

bc_desktop/template/desktops:
total 6
-rwxr-xr-x. 1 chris.welsh chris.welsh.dg 1450 Jul 11 07:11 gnome.sh
-rwxr-xr-x. 1 chris.welsh chris.welsh.dg    8 Jul 11 07:11 kde.sh
-rwxr-xr-x. 1 chris.welsh chris.welsh.dg 1391 Jul 11 07:11 mate.sh
-rwxr-xr-x. 1 chris.welsh chris.welsh.dg 1892 Sep 11 13:36 xfce.sh

At our site, we don’t use all these GPU settings. Maybe something is clashing with your requests or maybe it’s the way we have Slurm configured that we don’t need to specify all of this but we simply use:
- "--gpus-per-node=x”

:man_facepalming: The script element is not within the batch_connect element.

It should have this format where batch_connect and script are both on the left most indent.

batch_connect:
  # ...
script:
  # ..

Thx Jeff,

Was excited to see what you found. Tried it, but looks like I still have the issue. BTW are there any logs for reporting my nesting elements mistake? Thx

Here is what I have now.

---
batch_connect:
  template: vnc
script:
  native:
    - "--gres=gpu"
    - "--nodes=1"
    - "--gpus=1"
    - "--ntasks=1"
    - "--gpus-per-task=1"
submit.yml.erb (END)

So it looks like I’m still stuck. Any other advise gratefully accepted. :slight_smile: