Need Assistance Getting VritualGL Configured on GPU Nodes

brad.traver · January 26, 2023, 11:04pm

I’m trying to get VirtualGL configured so we can run apps with GPU acceleration. Our GPU nodes run GPU Direct Storage which may cause issues with the VirutalGL config. Has anyone deployed VirtualGL on nodes that also run GPU Direct Storage?

Micket · January 27, 2023, 1:13am

Is GPU Direct Storage preventing the use of EGLDevice buffers? Apart from some basic device permissions that’s just about all that’s really required for the EGL backend in VGL.

jeff.ohrstrom · January 27, 2023, 2:58pm

I’m unfamiliar with GPU Direct Storage. Can you give more details on what errors you see?

brad.traver · January 31, 2023, 10:14pm

GPU Direct Storage is an Nvidia technology that allows us to load data directly into GPU memory from a very high speed storage appliance via Infiniband. When I try to configure VirtualGL as per it’s documentation, these are the errors I get.

If I select the EGL back end only:

Restrict framebuffer device access to vglusers group (recommended)?
[Y/n]
n
... Modifying /etc/security/console.perms to disable automatic permissions
    for DRI devices ...
... Creating /etc/modprobe.d/virtualgl.conf to set requested permissions for
    /dev/nvidia* ...
... Attempting to remove nvidia module from memory so device permissions
    will be reloaded ...
modprobe: FATAL: Module nvidia is in use.
... Granting write permission to /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3 /dev/nvidia4 /dev/nvidia5 /dev/nvidia6 /dev/nvidia7 /dev/nvidia-caps /dev/nvidiactl /dev/nvidia-fs0 /dev/nvidia-fs1 /dev/nvidia-fs10 /dev/nvidia-fs11 /dev/nvidia-fs12 /dev/nvidia-fs13 /dev/nvidia-fs14 /dev/nvidia-fs15 /dev/nvidia-fs2 /dev/nvidia-fs3 /dev/nvidia-fs4 /dev/nvidia-fs5 /dev/nvidia-fs6 /dev/nvidia-fs7 /dev/nvidia-fs8 /dev/nvidia-fs9 /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools for all users ...
... Granting write permission to /dev/dri/card0 for all users ...

If I select the GLX + EGL back ends:

Disable XTEST extension (recommended)?
[Y/n]
Y
... Modifying /etc/security/console.perms to disable automatic permissions
    for DRI devices ...
... Creating /etc/modprobe.d/virtualgl.conf to set requested permissions for
    /dev/nvidia* ...
... Attempting to remove nvidia module from memory so device permissions
    will be reloaded ...
modprobe: FATAL: Module nvidia is in use.
... Granting write permission to /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3 /dev/nvidia4 /dev/nvidia5 /dev/nvidia6 /dev/nvidia7 /dev/nvidia-caps /dev/nvidiactl /dev/nvidia-fs0 /dev/nvidia-fs1 /dev/nvidia-fs10 /dev/nvidia-fs11 /dev/nvidia-fs12 /dev/nvidia-fs13 /dev/nvidia-fs14 /dev/nvidia-fs15 /dev/nvidia-fs2 /dev/nvidia-fs3 /dev/nvidia-fs4 /dev/nvidia-fs5 /dev/nvidia-fs6 /dev/nvidia-fs7 /dev/nvidia-fs8 /dev/nvidia-fs9 /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools for all users ...
... Granting write permission to /dev/dri/card0 for all users ...
... /etc/gdm/Init/Default has been saved as /etc/gdm/Init/Default.orig.vgl ...
... Adding xhost +LOCAL: to /etc/gdm/Init/Default script ...
... Creating /usr/share/gdm/greeter/autostart/virtualgl.desktop ...
... /etc/gdm/custom.conf has been saved as /etc/gdm/custom.conf.orig.vgl ...
... Disabling Wayland in /etc/gdm/custom.conf ...
... Disabling XTEST extension in /etc/gdm/custom.conf ...
... Setting default run level to 5 (enabling graphical login prompt) ...
... Commenting out DisallowTCP line (if it exists) in /etc/gdm/custom.conf ...

In both cases, VirtualGL appears to be trying to unload the Nvidia kernel modules. The modules can’t be unloaded because GPU Direct Storage is running.

brad.traver · January 31, 2023, 10:17pm

VirtualGL is trying to unload the Nvidia kernel modules when running the vglserver_config command.

Micket · January 31, 2023, 10:40pm

Unloading nvidia modules unless you stop whatever services keep it on. But worst case, just restart the node afterwards. It will pick up the new permissions from virtualgl.conf on next boot. It’s not like you need to re-run the config. It’s a one time thing.

brad.traver · February 7, 2023, 9:27pm

I can’t unload the Nvidia drivers. I get the following when I try:
rmmod: ERROR: Module nvidia is in use

If I add the -f flag to force it, I get this:

rmmod: ERROR: could not remove 'nvidia': Resource temporarily unavailable
rmmod: ERROR: could not remove module nvidia: Resource temporarily unavailable

Am I missing something really obvious?

Micket · February 7, 2023, 9:57pm

Am I missing something really obvious?

I think so yes. You could spend time looking for whatever last processes that was holding the nvidia module active and temporarily stop that while you set up vglserver_config. This is a one time thing, during the setup of the node.

Or you can just estart the machine after running this vglserver_config. Looks to me that it should just work afterwards anyway, since rebooting will effectively reload all your modules (on boot), thus picking up all the correct permissions.

I don’t use vglserver_config myself for the clusters i deploy it at, it’s not doing anything that magical.

For example, for my compute nodes, where i only use EGL backend, I literally only install VirtualGL and just set up /etc/udev/rules.d/99-virtualgl-dri.rules :

KERNEL=="renderD*", MODE="0666", OWNER="root", GROUP="root"

and reboot the node. Permissions for the /dev/card* devices are managed by SLURM for each job, so i don’t need to touch those.

For my login nodes i use

KERNEL=="card*|renderD*", MODE="0666", OWNER="root", GROUP="root"

plus some custom xorg.conf, that xhost +LOCAL: thing for the display manager, and enabling the graphical target as the default runlevel. Don’t even use vglserver_config here either becuase i prefer to set up things with ansible.

brad.traver · February 7, 2023, 10:34pm

I don’t use vglserver_config myself for the clusters i deploy it at, it’s not doing anything that magical. For example, for my compute nodes, where i only use EGL backend, I literally only install VirtualGL and just set up /etc/udev/rules.d/99-virtualgl-dri.rules

You just install it, create that file and it works? The problem I’m trying to solve is with Metashape. It crashes when it gets to a step where it tries to use a GPU. The number of GPUs the user asks for shows up correctly. I will go and add that file to see if it resolves my issue.

Micket · February 8, 2023, 4:59pm

You just install it, create that file and it works?

and restart the machine so that new rules are picked up, and have SLURM set up with cgroups to grant permissions to the GPU (which is the usual configuration for compute nodes) and have VGL_DISPLAY assigned to their job allocation, then yes. Login node uses GLX backend by default, so, more setup required there unless you only have 1 GPU (then you could always just select that device with VGL_DISPLAY).

The problem I’m trying to solve is with Metashape. It crashes when it gets to a step where it tries to use a GPU. The number of GPUs the user asks for shows up correctly.

The number of GPUs? In the software? That doesn’t sound like it has anything at all to do with VGL.
Nothing described here would immediately make me think VGL is to blame.

system · August 7, 2023, 4:59pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OpenGL Hardware Acceleration Fails on Non-Primary GPUs in Interactive Desktop Sessions Get Help question	4	51	January 16, 2025
Multi-GPU, Xorg and VirtualGL Get Help question	7	2012	May 1, 2022
Using GPU for 3D graphic rendering in interactive Desktop session Get Help ondemand2 , question	4	64	November 18, 2024
VirtualGL on Shared Node with Multiple GPUs Get Help question	5	510	September 21, 2022
GPU-Enabled VirtualGL session for interactive desktops Get Help question	3	769	January 29, 2024

Need Assistance Getting VritualGL Configured on GPU Nodes

Related topics