We have this deeplearning workstation at our department with 4 GPUs and lately we have had more users than GPUs, so I thought that introducing OOD with slurm as a queue system would solve our troubles.
The software we use are mainsly jupyter, deeplabcut and interactive desktops.
I understand that this is not the intended use of OOD, so I am just hoping to understand things a little better to tweak them.
For the most part everything works fine, I am able submit jobs, create jupyter and desktop sessions, but what I am having trouble is connecting with the.
For the interactive desktop, noVNC is not configured to the correct websocket port. I can check the connection.yml file and enter the correct port manually so it is not the end of the world but it is not ideal.
In a similar fashion, jupyter points to the wrong location. I can again go to the connections.yml and check the port and password and enter that manually but I really want to remove this friction from my users that are not exaclty tech savvy.
I think I must understand better how OOD handles things, but web technology is not my forte.
If anyone has insight into troubleshooting this or experience with this unconventional deployment I would appreciate the help.
Thank you for taking the time to troubleshoot this with me, I really appreciate it.
Jupyter is a fine place to start. I took a stab at view.html.erb, but it became apparent to me that I did not understand how things were being handled.
I reckon there are nuances from running both OOD and jobs in the same machine that I do quite grasp. At times, I wonder if it would not had been better to have deployed OOD from an LXC container.
Thanks a lot, that seems to have done the trick.
both jupyter and novnc seem to be working as expected.
I now just need to troubleshoot the issue that vnc sessions are not being killed/cleaned if the delete button is pressed. I have looked around some other posts and It seems that this is related to lock files at /tmp.
What type of scheduler do you run? If it’s Slurm you need set the ProctrackType to proctrack/cgroup because some of these vnc server processes’ have parent PID of 1.