VirtualGL on Shared Node with Multiple GPUs

First thank @mcuma for raising the question in this post and everyone else for providing additional information.

We were able to test direct GPU access through the EGL backend using @Micket’s method. However, for one of the applications we tested, older versions didn’t work properly: only the latest version could render smooth images, and all older versions showed corrupted images. The VirtualGL 3.0 user guide specifically mentioned that “As of this writing, the EGL back end does not yet support all of the GLX extensions and esoteric OpenGL features that the GLX back end supports.” Maybe that’s the reason the older versions of the software failed.

We then tested the GLX backend. With this approach, all versions of the software worked fine. However, we encountered a problem when testing two jobs shared the same node with multiple GPUs. The X server were started on the right GPU device. However, the execution of vglrun that happened later would always make the first vglrun terminated with the following error:

[VGL] ERROR: OpenGL error 0x0502
[VGL] ERROR: in readPixels--
[VGL]    435: Could not read pixels

Any ideas are appreciated.

Ping

PS, I want to mention that instead of changing TaskProlog in Slurm, I wrote a wrapper script for vglrun. It sets up the correct environment variables to use either the EGL backend or the GLX backend.

I’m not sure, but I did find this where you may be able to find some debugging tips.

These may be naive questions that you’ve already sorted through - but they popped into my mind and may help you debug.

  • Sounds like you know it’s an issue that only arises when you use multiple GPUs. Is that right?
  • Are you sure you’re attaching to the right $DISPLAY in each session? Lot’s of chatter about $DISPLAY on the github issue.

Hi Ping,

what’s the application that does not run right with EGL?

We are setting up EGL with our new Rocky 8 setup and it’s working well, except for one application - IDV - which uses Java 3D and we suspect it’s hitting the EGL’s incomplete GLX implementation.

Thanks Jeff. I’ll take a look at the discussion from the link you posted and see if anything will be useful.

To answer your question:

  1. yes, it only popped up when multiple GPUs are used, and each GPU runs a different X server.
    1. I am sure I have attached the right $DISPLAY since the second vglrun always works while it causes the first running virtual crashes. In other words, the first vglrun works fine until the second vglrun starts running.

Hi Martin,

The application having problems with the EGL backend is synopsis (version 5.4 and older). Version 5.5 works fine.