First thank @mcuma for raising the question in this post and everyone else for providing additional information.
We were able to test direct GPU access through the EGL backend using @Micket’s method. However, for one of the applications we tested, older versions didn’t work properly: only the latest version could render smooth images, and all older versions showed corrupted images. The VirtualGL 3.0 user guide specifically mentioned that “As of this writing, the EGL back end does not yet support all of the GLX extensions and esoteric OpenGL features that the GLX back end supports.” Maybe that’s the reason the older versions of the software failed.
We then tested the GLX backend. With this approach, all versions of the software worked fine. However, we encountered a problem when testing two jobs shared the same node with multiple GPUs. The X server were started on the right GPU device. However, the execution of vglrun that happened later would always make the first vglrun terminated with the following error:
[VGL] ERROR: OpenGL error 0x0502
[VGL] ERROR: in readPixels--
[VGL] 435: Could not read pixels
Any ideas are appreciated.
PS, I want to mention that instead of changing TaskProlog in Slurm, I wrote a wrapper script for vglrun. It sets up the correct environment variables to use either the EGL backend or the GLX backend.
Thanks Jeff. I’ll take a look at the discussion from the link you posted and see if anything will be useful.
To answer your question:
yes, it only popped up when multiple GPUs are used, and each GPU runs a different X server.
I am sure I have attached the right $DISPLAY since the second vglrun always works while it causes the first running virtual crashes. In other words, the first vglrun works fine until the second vglrun starts running.