Problems using vglrun on the interactive desktop

The environment:
CentOS-7.6 x64
VMware VGA and RTX-2080-Ti
ondemand-2.0.28
VirtualGL-3.0.1
xfce desktop
Desktops: GitHub - OSC/bc_desktop: [MOVED] Batch Connect - Desktop

  *-display:0
       description: VGA compatible controller
       product: GD 5446
       vendor: Cirrus Logic
       physical id: 2
       bus info: pci@0000:00:02.0
       version: 00
       width: 32 bits
       clock: 33MHz
       capabilities: vga_controller rom
       configuration: driver=cirrus latency=0
       resources: irq:0 memory:f0000000-f1ffffff memory:fe0d0000-fe0d0fff memory:fe0c0000-fe0cffff
  *-display:1
       description: VGA compatible controller
       product: TU102 [GeForce RTX 2080 Ti Rev. A]
       vendor: NVIDIA Corporation
       physical id: 3
       bus info: pci@0000:00:03.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:29 memory:fd000000-fdffffff memory:e0000000-efffffff memory:f2000000-f3ffffff ioport:c000(size=128) memory:fe000000-fe07ffff

Use the interactive desktop to report the following error:

[user01@node004 Desktop]$ glxgears 
6465 frames in 5.0 seconds = 1292.977 FPS
4637 frames in 5.0 seconds = 927.256 FPS
X connection to :1.0 broken (explicit kill or server shutdown).
[user01@node004 Desktop]$ vglrun glxgears
Invalid MIT-MAGIC-COOKIE-1 key[VGL] ERROR: Could not open display :0.

Connected using vncview, it works fine.

Hi, I have to say right off the start, these are very hard issues for us to debug because they’re so specific to your environment.

The first question I’d ask is what scheduler you use and how you’re able to allocate it and start that X server.

Here’s what we do with a Slurm prologue.

if [[ "$SLURM_LOCALID" == "0" && "$SLURM_JOB_GRES" == *"vis"* ]]; then
  if [ -n "$CUDA_VISIBLE_DEVICES" ]; then
    FIRSTGPU=$(echo $CUDA_VISIBLE_DEVICES | tr ',' "\n" | head -1)
    setsid /usr/bin/X :${FIRSTGPU} -noreset >& /dev/null &
    sleep 2
    if [ -n "$DISPLAY" ]; then
      echo "export OLDDISPLAY=$DISPLAY"
    fi
    echo "export DISPLAY=:$FIRSTGPU"
  fi
fi

We initially had some weird vglrun issues as well, but none of them were related to OnDemand. A solution to one of the problems we encountered was specifying the display with “-d”. vglrun -d :x.x glxgears where :x.x is the display numbers. For example, vglrun -d :0.0 glxgears. Sometimes the default display doesn’t work, especially if you have multiple gpus in the system.

Also, to help us troubleshoot this, could you provide some additional information?
nvidia-smi --query-gpu=gpu_bus_id --format=csv,noheader and also the full output of nvidia-smi.

This is a user preference, but I also prefer glxspheres64 since it prints fps and is easy to see FPS speedup.

$ /opt/VirtualGL/bin/glxspheres64 
Polygons in scene: 62464 (61 spheres * 1024 polys/spheres)
GLX FB config ID of window: 0x163 (8/8/8/0)
Visual ID of window: 0x3f6
Context is Direct
OpenGL Renderer: llvmpipe (LLVM 12.0.1, 256 bits)
44.023927 frames/sec - 38.649486 Mpixels/sec
43.564125 frames/sec - 38.245817 Mpixels/sec

With basic vglrun:

$ vglrun /opt/VirtualGL/bin/glxspheres64 
Polygons in scene: 62464 (61 spheres * 1024 polys/spheres)
GLX FB config ID of window: 0xdd (8/8/8/0)
Visual ID of window: 0x21
Segmentation fault (core dumped)

but with the display set:

$ vglrun -d :0.2 /opt/VirtualGL/bin/glxspheres64 
Polygons in scene: 62464 (61 spheres * 1024 polys/spheres)
GLX FB config ID of window: 0x4cf (8/8/8/0)
Visual ID of window: 0x21
Context is Direct
OpenGL Renderer: Quadro M6000 24GB/PCIe/SSE2
724.313754 frames/sec - 635.889531 Mpixels/sec
746.916392 frames/sec - 655.732839 Mpixels/sec

In this example, I had to use the 3rd display :0.2. This is because this system has 4 GPUs and I was allocated the 3rd gpu. So you may have to test each of your gpus if you’re on a multi-gpu system.

Hi, I’m using slurm scheduler.
I did not add gres parameter to the submit/slurm.yml.erb file of bc_desktop directory before, I will try to add gres.

Are the following parameters the content of a desktops/xfce.sh script?

if [[ "$SLURM_LOCALID" == "0" && "$SLURM_JOB_GRES" == *"vis"* ]]; then
  if [ -n "$CUDA_VISIBLE_DEVICES" ]; then
    FIRSTGPU=$(echo $CUDA_VISIBLE_DEVICES | tr ',' "\n" | head -1)
    setsid /usr/bin/X :${FIRSTGPU} -noreset >& /dev/null &
    sleep 2
    if [ -n "$DISPLAY" ]; then
      echo "export OLDDISPLAY=$DISPLAY"
    fi
    echo "export DISPLAY=:$FIRSTGPU"
  fi
fi

My virtual machine has a virtual graphics card and a RTX graphics card

What I’ve shared is what we use as a Slurm proglogue. But it does account for the vis type GRES which is for visualization.

https://slurm.schedmd.com/prolog_epilog.html

bc_desktop adds the gres parameter, running vglrun is still error.

[user01@node002 Desktop]$ vglrun glxgears 
No protocol specified
[VGL] ERROR: Could not open display :0.

[root@node002 output]# scontrol show job 49
JobId=49 JobName=sys/dashboard/sys/bc_desktop/linux
   UserId=user01(1001) GroupId=group01(1001) MCS_label=N/A
   Priority=4294901754 Nice=0 Account=user01 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:01:23 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2022-10-14T17:02:20 EligibleTime=2022-10-14T17:02:20
   AccrueTime=2022-10-14T17:02:20
   StartTime=2022-10-14T17:02:20 EndTime=2022-10-14T18:02:20 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-10-14T17:02:20
   Partition=vis AllocNode:Sid=node001:537
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node002
   BatchHost=node002
   NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,node=1,billing=2,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/gfs/home/user01/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/linux/output/263a9175-59fe-4031-be67-4e652d67ef37
   StdErr=/gfs/home/user01/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/linux/output/263a9175-59fe-4031-be67-4e652d67ef37/output.log
   StdIn=/dev/null
   StdOut=/gfs/home/user01/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/linux/output/263a9175-59fe-4031-be67-4e652d67ef37/output.log
   Power=
   TresPerNode=gpu:1,gpu