I would appreciate any recommendations / suggestions for a data-center GPU card that works well with Open OnDemand for interactive use of HPC apps and desktop sessions. We currently making use of the Nvidia L40s cards for computational purposes and they work well but from a price point it’s expensive for our viz purposes. What are others in the Open OnDemand community using? What are your experiences with shared app/vdi sessions on GPU cards like?
I have also been looking into this and I have concluded that it makes sense to setup visualization boxes, with SR-IOV capable GPUs, so I would probably look at AMD or Intel gpus. The idea would be to have sliceable GPUs so they could be used for courses and whatnot.
My undestanding is that in order to unlock rendering on a Nvidia vGPU slice, you must purchase an Nvidia RTX Virtual Workstation (vWS) subscription license. While these features are freely available in AMD and Intel GPUs.
The one caveat is that this visualization boxes need to run on relatively recent kernels.
At Montana State University we have a range of NVidia A40, A100, H100, B6000, and some new B300’s on the way. Some of the compute nodes have 2 GPUs up to 4-way GPUs of the bigger models. We use OOD Xfce remote desktops to reserve these GPU compute nodes. We do have vCenter vGPU machines using ESXi but we haven’t tried allowing vGPU reservations yet, so I’m not 100% sure of the previous poster discussing RTX Virtual Workstation licensing. The VMs using these can reserve mismatching vGPU sizes and we have no Virtual Workstation licensing and we can keep slicing it up to the amount of vGPU RAM available. So far as I know we don’t need the Virtual Workstation licensing, but I could be wrong. We just don’t use the ESXi vGPU for that.
We recently incorporated a couple of large VDI systems that have 48 core (2 sockets 24 core) CPU using NVidia T4 and A40 models. We were given the systems by University IT because demand for desktop sessions wasn’t very high. We repurposed the VDI systems into our cluster to make better use of the hardware as these systems were NOT cheap!
We also have a new product called CatChat which has reserved a number of the multiple A100/H100 for LLM and AI model checkouts using reservations on our Slurm-based cluster. I believe this is being extended to the recent addition of VDI to our cluster. This (the CatChat system) has become extremely popular with our university, enough so that we’ve been give a sizeable budget to accomodate the number of colleges interested in utitlizing these tools. Access to our cluster is not required but the “CatChat” sessions run in slurm-based jobs on the cluster.