I’m trying to get VirtualGL configured so we can run apps with GPU acceleration. Our GPU nodes run GPU Direct Storage which may cause issues with the VirutalGL config. Has anyone deployed VirtualGL on nodes that also run GPU Direct Storage?
Is GPU Direct Storage preventing the use of EGLDevice buffers? Apart from some basic device permissions that’s just about all that’s really required for the EGL backend in VGL.
I’m unfamiliar with GPU Direct Storage. Can you give more details on what errors you see?
GPU Direct Storage is an Nvidia technology that allows us to load data directly into GPU memory from a very high speed storage appliance via Infiniband. When I try to configure VirtualGL as per it’s documentation, these are the errors I get.
If I select the EGL back end only:
Restrict framebuffer device access to vglusers group (recommended)?
[Y/n]
n
... Modifying /etc/security/console.perms to disable automatic permissions
for DRI devices ...
... Creating /etc/modprobe.d/virtualgl.conf to set requested permissions for
/dev/nvidia* ...
... Attempting to remove nvidia module from memory so device permissions
will be reloaded ...
modprobe: FATAL: Module nvidia is in use.
... Granting write permission to /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3 /dev/nvidia4 /dev/nvidia5 /dev/nvidia6 /dev/nvidia7 /dev/nvidia-caps /dev/nvidiactl /dev/nvidia-fs0 /dev/nvidia-fs1 /dev/nvidia-fs10 /dev/nvidia-fs11 /dev/nvidia-fs12 /dev/nvidia-fs13 /dev/nvidia-fs14 /dev/nvidia-fs15 /dev/nvidia-fs2 /dev/nvidia-fs3 /dev/nvidia-fs4 /dev/nvidia-fs5 /dev/nvidia-fs6 /dev/nvidia-fs7 /dev/nvidia-fs8 /dev/nvidia-fs9 /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools for all users ...
... Granting write permission to /dev/dri/card0 for all users ...
If I select the GLX + EGL back ends:
Disable XTEST extension (recommended)?
[Y/n]
Y
... Modifying /etc/security/console.perms to disable automatic permissions
for DRI devices ...
... Creating /etc/modprobe.d/virtualgl.conf to set requested permissions for
/dev/nvidia* ...
... Attempting to remove nvidia module from memory so device permissions
will be reloaded ...
modprobe: FATAL: Module nvidia is in use.
... Granting write permission to /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3 /dev/nvidia4 /dev/nvidia5 /dev/nvidia6 /dev/nvidia7 /dev/nvidia-caps /dev/nvidiactl /dev/nvidia-fs0 /dev/nvidia-fs1 /dev/nvidia-fs10 /dev/nvidia-fs11 /dev/nvidia-fs12 /dev/nvidia-fs13 /dev/nvidia-fs14 /dev/nvidia-fs15 /dev/nvidia-fs2 /dev/nvidia-fs3 /dev/nvidia-fs4 /dev/nvidia-fs5 /dev/nvidia-fs6 /dev/nvidia-fs7 /dev/nvidia-fs8 /dev/nvidia-fs9 /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools for all users ...
... Granting write permission to /dev/dri/card0 for all users ...
... /etc/gdm/Init/Default has been saved as /etc/gdm/Init/Default.orig.vgl ...
... Adding xhost +LOCAL: to /etc/gdm/Init/Default script ...
... Creating /usr/share/gdm/greeter/autostart/virtualgl.desktop ...
... /etc/gdm/custom.conf has been saved as /etc/gdm/custom.conf.orig.vgl ...
... Disabling Wayland in /etc/gdm/custom.conf ...
... Disabling XTEST extension in /etc/gdm/custom.conf ...
... Setting default run level to 5 (enabling graphical login prompt) ...
... Commenting out DisallowTCP line (if it exists) in /etc/gdm/custom.conf ...
In both cases, VirtualGL appears to be trying to unload the Nvidia kernel modules. The modules can’t be unloaded because GPU Direct Storage is running.
VirtualGL is trying to unload the Nvidia kernel modules when running the vglserver_config command.
Unloading nvidia modules unless you stop whatever services keep it on. But worst case, just restart the node afterwards. It will pick up the new permissions from virtualgl.conf on next boot. It’s not like you need to re-run the config. It’s a one time thing.
I can’t unload the Nvidia drivers. I get the following when I try:
rmmod: ERROR: Module nvidia is in use
If I add the -f flag to force it, I get this:
rmmod: ERROR: could not remove 'nvidia': Resource temporarily unavailable
rmmod: ERROR: could not remove module nvidia: Resource temporarily unavailable
Am I missing something really obvious?
Am I missing something really obvious?
I think so yes. You could spend time looking for whatever last processes that was holding the nvidia module active and temporarily stop that while you set up vglserver_config. This is a one time thing, during the setup of the node.
Or you can just estart the machine after running this vglserver_config. Looks to me that it should just work afterwards anyway, since rebooting will effectively reload all your modules (on boot), thus picking up all the correct permissions.
I don’t use vglserver_config myself for the clusters i deploy it at, it’s not doing anything that magical.
For example, for my compute nodes, where i only use EGL backend, I literally only install VirtualGL and just set up /etc/udev/rules.d/99-virtualgl-dri.rules
:
KERNEL=="renderD*", MODE="0666", OWNER="root", GROUP="root"
and reboot the node. Permissions for the /dev/card* devices are managed by SLURM for each job, so i don’t need to touch those.
For my login nodes i use
KERNEL=="card*|renderD*", MODE="0666", OWNER="root", GROUP="root"
plus some custom xorg.conf, that xhost +LOCAL:
thing for the display manager, and enabling the graphical target as the default runlevel. Don’t even use vglserver_config here either becuase i prefer to set up things with ansible.
I don’t use vglserver_config myself for the clusters i deploy it at, it’s not doing anything that magical. For example, for my compute nodes, where i only use EGL backend, I literally only install VirtualGL and just set up
/etc/udev/rules.d/99-virtualgl-dri.rules
You just install it, create that file and it works? The problem I’m trying to solve is with Metashape. It crashes when it gets to a step where it tries to use a GPU. The number of GPUs the user asks for shows up correctly. I will go and add that file to see if it resolves my issue.
You just install it, create that file and it works?
and restart the machine so that new rules are picked up, and have SLURM set up with cgroups to grant permissions to the GPU (which is the usual configuration for compute nodes) and have VGL_DISPLAY assigned to their job allocation, then yes. Login node uses GLX backend by default, so, more setup required there unless you only have 1 GPU (then you could always just select that device with VGL_DISPLAY).
The problem I’m trying to solve is with Metashape. It crashes when it gets to a step where it tries to use a GPU. The number of GPUs the user asks for shows up correctly.
The number of GPUs? In the software? That doesn’t sound like it has anything at all to do with VGL.
Nothing described here would immediately make me think VGL is to blame.
This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.