Thank you. The jupyter app is just the OSC bc_example_jupyter. The config I’m using for it is shown here. I’ve deployed this several times before on different clusters, and not run into a similar problem.
The web GUI is now showing:
We're sorry, too many people are accessing this website at the same time. We're working on this problem. Please try again later.
although I’d be very surprised if anyone else is using it (the monitoring is proxied through this server too, so unfortunately I can’t check that way!).
[root@ondemand cloud-user]# grep -r "Request queue full" /var/log/ondemand-nginx/*/
/var/log/ondemand-nginx/someuser/error.log:[ W 2023-07-13 00:07:47.4147 18086/Tk age/Cor/Con/CheckoutSession.cpp:266 ]: [Client 7-4] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
/var/log/ondemand-nginx/someuser/error.log:[ W 2023-07-13 01:15:03.0711 18086/Tm age/Cor/Con/CheckoutSession.cpp:266 ]: [Client 8-4] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
/var/log/ondemand-nginx/someuser/error.log:[ W 2023-07-13 01:18:29.8312 18086/To age/Cor/Con/CheckoutSession.cpp:266 ]: [Client 9-4] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
The OOD dashboard recovered, but trying to launch another jupyter notebook has made it hang again. However looking at the ondemand/data/sys/dashboard/batch_connect/sys/jupyter/slurm/output/07ec87c8-0d11-4057-8a23-ba430afedfee/output.log file the job appeared to launch fine, and squeue shows the job running. CPU looks almost idle, loads of memory, no swap.
Re. the path /etc/ood/config/apps/jupyter/slurm.yml, that does exist. I’d assumed the slurm part was from the cluster name which is “slurm”, i.e. in my open ondemand ansible configuration the openondemand_clusters is the ood-ansible clusters role variable. That bit of the configuration is definitely the same as other deployments.
you can search /var/log/ondemand-nginx/$USER/error.log for execve to see the actual command we issue. It’s not simply squeue but formatted.
That’s the only thing that comes to mind. we issue that command to build the page, most other things should be in memory. I’d take a glance at what all commands you’re having to issue to build that page.