OpenGL Hardware Acceleration Fails on Non-Primary GPUs in Interactive Desktop Sessions

Hello,

I’m encountering an issue with OpenGL hardware acceleration in interactive desktop sessions started from Open OnDemand. The problem occurs when a session is assigned to a GPU other than GPU 0. Below is the detailed setup and the behavior I’m experiencing:

System Setup:

  • Cluster Configuration:
    • 10 NVIDIA RTX 5000 Ada Generation GPUs per node (headless setup)
    • Nodes are accessed via SSH (headnode → node3)
  • NVIDIA Driver Version: 555.42.06
  • Open OnDemand Version: 3.1.10
  • Environment Variables:
    • __GLX_VENDOR_LIBRARY_NAME=nvidia
    • __NV_PRIME_RENDER_OFFLOAD=1
  • xorg.conf: generated with nvidia-xconfig --enable-all-gpus --use-display-device=none --virtual=1920x1080 --allow-empty-initial-configuration

Problem Description:

  • Behavior:
    • When a session is assigned GPU 0 (e.g., DISPLAY=:1.0), OpenGL applications like vmd run with full hardware acceleration, and the GPU is utilized as expected.
    • When a session is assigned any other GPU (e.g., GPU 1 through GPU 9, DISPLAY=:2.0, :3.0, etc.), OpenGL applications fail with the following error:
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  150 (GLX)
  Minor opcode of failed request:  24 (X_GLXCreateNewContext)
  Value in failed request:  0x0
  Serial number of failed request:  50
  Current serial number in output stream:  51

I’ve also tried installing virtualgl and following the instructions on this post but get the same result

Any insights or guidance would be greatly appreciated. Please let me know if additional details or logs are needed.

I’ve also tried installing virtualgl

I’m confused, since this is the only way you are getting hardware acceleration as far as I know, so I have no idea what you tried before that, unless you are hacking in some way of dispatching jobs that launch the primary X server for each job with a fake screen.

The nice part of virtualgl is the EGL backend. There is literally no setup of fake headless X-servers. Start as many VNC servers you want, and just pick whichever GPU you wish to use (which can be different for different applications within the same desktop session) by specifying the

VGL_DISPLAY=/dev/card3 vglrun vmd
# or
vglrun -d /dev/card3 vmd

In this scenario, there is literally no “primary” GPU. They are all equal.

I went to another node that I had not made any changes on. I installed VirtualGL 3.1-3 from EPEL. Then I ran the following:

$ nvidia-smi --query-gpu=gpu_bus_id --format=csv,noheader
00000000:45:00.0
$ vglrun -d /dev/dri/card1
card1   card10  
$ vglrun -d /dev/dri/card1 +v glxgears
[VGL] NOTICE: Added /usr/lib64/VirtualGL to LD_LIBRARY_PATH
[VGL] Shared memory segment ID for vglconfig: 6553619
[VGL] VirtualGL v3.1 64-bit (Build 20240622)
[VGL] ERROR: Could not load EGL functions
[VGL]    /usr/lib64/VirtualGL/libvglfaker.so: undefined symbol: eglGetProcAddress
[VGL] Shared memory segment ID for vglconfig: 6553621

Please let me know if you have any suggestions.

Didn’t even know they started (attempting) to package VirtualGL in EPEL 9. They never did previously.

Quickly trying it out, it looks like a poor packaging job.

ldd /usr/lib64/VirtualGL/libvglfaker.so

shows no linking to EGL (or GLX for that matter)

It can be worked around by just explicitly forcing it to use libEGL via

LD_PRELOAD=/lib64/libEGL.so.1 vglrun -d /dev/dri/card3 gxlgears

I have no idea why whoever packaged it for EPEL did this.

The official packages from Releases · VirtualGL/virtualgl · GitHub don’t suffer from this and work fine for me in Rocky Linux 9.

Switching to the GitHub RPM works perfectly! I even have it working in my singularity containers.

Your TaskProlog snippet also saved me a lot of time as well.

Thank you!