Debugging OOD hanging

Hi. I’ve I’ve got a RockyLinux 8 server running OOD v2.0.29-1. The web UI hangs if I try to run interactive jobs, with the syslog showing things like:

[Wed Jul 12 22:47:15.408153 2023] [proxy:error] [pid 2711:tid 139867002492672] [client 10.29.107.91:60688] AH00898: Error reading from remote server returned by /pun/sys/dashboard/, referer: https://<FQDN>/pun/sys/dashboard/batch_connect/sys/jupyter/slurm/session_contexts
[Wed Jul 12 22:47:15.640963 2023] [lua:info] [pid 2711:tid 139867002492672] [client 10.29.107.91:60688] req_port="443" local_user="<USER>" req_is_https="true" log_hook="ood" req_user_ip="<IP>" req_referer="https://<FQDN>/pun/sys/dashboard/batch_connect/sys/jupyter/slurm/session_contexts" res_content_encoding="" req_accept_encoding="gzip, deflate, br" req_content_type="" req_accept_language="en-us,en;q=0.5" req_is_websocket="false" log_id="ZK6Ed1ez1e1jpTqKTMlT0AAAAAw" time_proxy="60293.316" req_hostname="<FQDN>" req_status="502" req_handler="proxy-server" log_time="2023-07-12T10:47:15.640883.0Z" res_location="" res_content_language="" req_accept="text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8" req_accept_charset="" time_user_map="0.001" req_user_agent="Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0" req_protocol="HTTP/1.1" res_content_length="" res_content_disp="" req_method="GET" res_content_type="" req_filename="proxy:http://localhost/pun/sys/dashboard/" res_content_location="" req_uri="/pun/sys/dashboard/" remote_user="<USER>" req_origin="" req_cache_control="" req_server_name="<FQDN>", referer: https://<FQDN>/pun/sys/dashboard/batch_connect/sys/jupyter/slurm/session_contexts

any hints on debugging this please?

Seems like there’s some issue with this app jupyter/slurm - Do you have a lot of ERB processing in this app? Can you share it? My guess is the app is doing something to hang.

Thank you. The jupyter app is just the OSC bc_example_jupyter. The config I’m using for it is shown here. I’ve deployed this several times before on different clusters, and not run into a similar problem.

The web GUI is now showing:


We're sorry, too many people are accessing this website at the same time. We're working on this problem. Please try again later.

although I’d be very surprised if anyone else is using it (the monitoring is proxied through this server too, so unfortunately I can’t check that way!).

OK - Check your /var/log/ondemand-nginx/$USER/error.log - I’ll bet it says something like what’s in this ticket.

We never figured out quite why it’s happening - but it would seem that some configuration/ERB/something is taking quite a long time to process.

The token in the URL seems to suggest it’s a sub app - i.e., sys/jupyter/slurm - that extra slurm tacked on suggests it’s trying to read the file /etc/ood/config/apps/jupyter/slurm.yml. Is it?

Hmm, you are right:

[root@ondemand cloud-user]# grep -r "Request queue full" /var/log/ondemand-nginx/*/
/var/log/ondemand-nginx/someuser/error.log:[ W 2023-07-13 00:07:47.4147 18086/Tk age/Cor/Con/CheckoutSession.cpp:266 ]: [Client 7-4] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
/var/log/ondemand-nginx/someuser/error.log:[ W 2023-07-13 01:15:03.0711 18086/Tm age/Cor/Con/CheckoutSession.cpp:266 ]: [Client 8-4] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
/var/log/ondemand-nginx/someuser/error.log:[ W 2023-07-13 01:18:29.8312 18086/To age/Cor/Con/CheckoutSession.cpp:266 ]: [Client 9-4] Returning HTTP 503 due to: Request queue full (configured max. size: 100)

The OOD dashboard recovered, but trying to launch another jupyter notebook has made it hang again. However looking at the ondemand/data/sys/dashboard/batch_connect/sys/jupyter/slurm/output/07ec87c8-0d11-4057-8a23-ba430afedfee/output.log file the job appeared to launch fine, and squeue shows the job running. CPU looks almost idle, loads of memory, no swap.

Re. the path /etc/ood/config/apps/jupyter/slurm.yml, that does exist. I’d assumed the slurm part was from the cluster name which is “slurm”, i.e. in my open ondemand ansible configuration the openondemand_clusters is the ood-ansible clusters role variable. That bit of the configuration is definitely the same as other deployments.

This page has to issue squeue commands. I wonder if that’s the bit that’s slow?

Doesn’t seem to be:

[root@ondemand cloud-user]# time squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              5269 interacti ood-jupy    someuser R      56:19      1 interactive-0

real    0m0.019s

you can search /var/log/ondemand-nginx/$USER/error.log for execve to see the actual command we issue. It’s not simply squeue but formatted.

That’s the only thing that comes to mind. we issue that command to build the page, most other things should be in memory. I’d take a glance at what all commands you’re having to issue to build that page.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.