OOD in a single workstation

Hello there,

We have this deeplearning workstation at our department with 4 GPUs and lately we have had more users than GPUs, so I thought that introducing OOD with slurm as a queue system would solve our troubles.

The software we use are mainsly jupyter, deeplabcut and interactive desktops.

I understand that this is not the intended use of OOD, so I am just hoping to understand things a little better to tweak them.

For the most part everything works fine, I am able submit jobs, create jupyter and desktop sessions, but what I am having trouble is connecting with the.

For the interactive desktop, noVNC is not configured to the correct websocket port. I can check the connection.yml file and enter the correct port manually so it is not the end of the world but it is not ideal.

In a similar fashion, jupyter points to the wrong location. I can again go to the connections.yml and check the port and password and enter that manually but I really want to remove this friction from my users that are not exaclty tech savvy.

I think I must understand better how OOD handles things, but web technology is not my forte.

If anyone has insight into troubleshooting this or experience with this unconventional deployment I would appreciate the help.

Best,

Hi and welcome!

I’d say let’s solve this one first as it’s easier. Can you share the view.html.erb for this Jupyter application?

Hi there,

Thank you for taking the time to troubleshoot this with me, I really appreciate it.

Jupyter is a fine place to start. I took a stab at view.html.erb, but it became apparent to me that I did not understand how things were being handled.

<form action="/node/<%= host %>/<%= port %>/login" method="post" target="_blank">
  <input type="hidden" name="password" value="<%= password %>">
  <button class="btn btn-primary" type="submit">
    <i class="fa fa-eye"></i> Connect to Jupyter
  </button>
</form>

This seems to match the logic at ./template/before.sh.erb

c.NotebookApp.base_url = '/node/${host}/${port}/'

From the output of one jupyter session I got this.

[I 2024-11-08 10:18:01.466 ServerApp] Jupyter Server 2.14.2 is running at:
[I 2024-11-08 10:18:01.466 ServerApp] http://localhost:21857/node/navu/21857/lab
[I 2024-11-08 10:18:01.466 ServerApp]     http://127.0.0.1:21857/node/navu/21857/lab

So I tried to simply append the port to the beginning of the address at the view.html.erb code, without success.

Yea this all should just work. That view.html.erb looks good.

If you right click the button and ‘Inspect’ the HTML do you see the correct password in the form’s HTML?

Yep, I have looked at the page source and as far as I can tell the password is correct.

Though I find it odd that the error that I get is a 404 error on port 80

So I am thinking that maybe I messed up something in the ood_portal.yml config. Which is likely since I am a complete noob at web stuff.

ood_portal.yml (13.5 KB)

I reckon there are nuances from running both OOD and jobs in the same machine that I do quite grasp. At times, I wonder if it would not had been better to have deployed OOD from an LXC container.

OK that’s easy, you haven’t enabled the reverse proxy yet. Follow these instructions and you should be good to go.

https://osc.github.io/ood-documentation/latest/how-tos/app-development/interactive/setup/enable-reverse-proxy.html

I doubt it. You’d have the same trouble you’re having now, but you’d also potentially have container issues on top of OOD setup issues.

Thanks a lot, that seems to have done the trick.
both jupyter and novnc seem to be working as expected.

I now just need to troubleshoot the issue that vnc sessions are not being killed/cleaned if the delete button is pressed. I have looked around some other posts and It seems that this is related to lock files at /tmp.

But that is a topic for another thread.

Thanks again for the help

What type of scheduler do you run? If it’s Slurm you need set the ProctrackType to proctrack/cgroup because some of these vnc server processes’ have parent PID of 1.

https://slurm.schedmd.com/slurm.conf.html#OPT_ProctrackType

Hi there,

Sorry for the delay in responding.

It seems that adopting cgroups as ProcTrack has indeed solved the issue.

Thanks a lot.