Seems like there’s some issue with this app jupyter/slurm - Do you have a lot of ERB processing in this app? Can you share it? My guess is the app is doing something to hang.
Thank you. The jupyter app is just the OSC bc_example_jupyter. The config I’m using for it is shown here. I’ve deployed this several times before on different clusters, and not run into a similar problem.
The web GUI is now showing:
We're sorry, too many people are accessing this website at the same time. We're working on this problem. Please try again later.
although I’d be very surprised if anyone else is using it (the monitoring is proxied through this server too, so unfortunately I can’t check that way!).
OK - Check your /var/log/ondemand-nginx/$USER/error.log - I’ll bet it says something like what’s in this ticket.
We never figured out quite why it’s happening - but it would seem that some configuration/ERB/something is taking quite a long time to process.
The token in the URL seems to suggest it’s a sub app - i.e., sys/jupyter/slurm - that extra slurm tacked on suggests it’s trying to read the file /etc/ood/config/apps/jupyter/slurm.yml. Is it?
[root@ondemand cloud-user]# grep -r "Request queue full" /var/log/ondemand-nginx/*/
/var/log/ondemand-nginx/someuser/error.log:[ W 2023-07-13 00:07:47.4147 18086/Tk age/Cor/Con/CheckoutSession.cpp:266 ]: [Client 7-4] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
/var/log/ondemand-nginx/someuser/error.log:[ W 2023-07-13 01:15:03.0711 18086/Tm age/Cor/Con/CheckoutSession.cpp:266 ]: [Client 8-4] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
/var/log/ondemand-nginx/someuser/error.log:[ W 2023-07-13 01:18:29.8312 18086/To age/Cor/Con/CheckoutSession.cpp:266 ]: [Client 9-4] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
The OOD dashboard recovered, but trying to launch another jupyter notebook has made it hang again. However looking at the ondemand/data/sys/dashboard/batch_connect/sys/jupyter/slurm/output/07ec87c8-0d11-4057-8a23-ba430afedfee/output.log file the job appeared to launch fine, and squeue shows the job running. CPU looks almost idle, loads of memory, no swap.
Re. the path /etc/ood/config/apps/jupyter/slurm.yml, that does exist. I’d assumed the slurm part was from the cluster name which is “slurm”, i.e. in my open ondemand ansible configuration the openondemand_clusters is the ood-ansible clusters role variable. That bit of the configuration is definitely the same as other deployments.
[root@ondemand cloud-user]# time squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5269 interacti ood-jupy someuser R 56:19 1 interactive-0
real 0m0.019s
you can search /var/log/ondemand-nginx/$USER/error.log for execve to see the actual command we issue. It’s not simply squeue but formatted.
That’s the only thing that comes to mind. we issue that command to build the page, most other things should be in memory. I’d take a glance at what all commands you’re having to issue to build that page.