Debugging OOD hanging

lost · July 12, 2023, 11:36am

Hi. I’ve I’ve got a RockyLinux 8 server running OOD v2.0.29-1. The web UI hangs if I try to run interactive jobs, with the syslog showing things like:

[Wed Jul 12 22:47:15.408153 2023] [proxy:error] [pid 2711:tid 139867002492672] [client 10.29.107.91:60688] AH00898: Error reading from remote server returned by /pun/sys/dashboard/, referer: https://<FQDN>/pun/sys/dashboard/batch_connect/sys/jupyter/slurm/session_contexts
[Wed Jul 12 22:47:15.640963 2023] [lua:info] [pid 2711:tid 139867002492672] [client 10.29.107.91:60688] req_port="443" local_user="<USER>" req_is_https="true" log_hook="ood" req_user_ip="<IP>" req_referer="https://<FQDN>/pun/sys/dashboard/batch_connect/sys/jupyter/slurm/session_contexts" res_content_encoding="" req_accept_encoding="gzip, deflate, br" req_content_type="" req_accept_language="en-us,en;q=0.5" req_is_websocket="false" log_id="ZK6Ed1ez1e1jpTqKTMlT0AAAAAw" time_proxy="60293.316" req_hostname="<FQDN>" req_status="502" req_handler="proxy-server" log_time="2023-07-12T10:47:15.640883.0Z" res_location="" res_content_language="" req_accept="text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8" req_accept_charset="" time_user_map="0.001" req_user_agent="Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0" req_protocol="HTTP/1.1" res_content_length="" res_content_disp="" req_method="GET" res_content_type="" req_filename="proxy:http://localhost/pun/sys/dashboard/" res_content_location="" req_uri="/pun/sys/dashboard/" remote_user="<USER>" req_origin="" req_cache_control="" req_server_name="<FQDN>", referer: https://<FQDN>/pun/sys/dashboard/batch_connect/sys/jupyter/slurm/session_contexts

any hints on debugging this please?

jeff.ohrstrom · July 12, 2023, 1:07pm

Seems like there’s some issue with this app jupyter/slurm - Do you have a lot of ERB processing in this app? Can you share it? My guess is the app is doing something to hang.

lost · July 12, 2023, 1:26pm

Thank you. The jupyter app is just the OSC bc_example_jupyter. The config I’m using for it is shown here. I’ve deployed this several times before on different clusters, and not run into a similar problem.

The web GUI is now showing:


We're sorry, too many people are accessing this website at the same time. We're working on this problem. Please try again later.

although I’d be very surprised if anyone else is using it (the monitoring is proxied through this server too, so unfortunately I can’t check that way!).

jeff.ohrstrom · July 12, 2023, 1:40pm

OK - Check your /var/log/ondemand-nginx/$USER/error.log - I’ll bet it says something like what’s in this ticket.

We never figured out quite why it’s happening - but it would seem that some configuration/ERB/something is taking quite a long time to process.

The token in the URL seems to suggest it’s a sub app - i.e., sys/jupyter/slurm - that extra slurm tacked on suggests it’s trying to read the file /etc/ood/config/apps/jupyter/slurm.yml. Is it?

github.com/OSC/ondemand

100 requests in queue

opened 02:23PM - 21 Jul 21 UTC

johrstrom

bug component/dashboard area/performance

A user at OSC hit this limit in NGINX. While I think the limit is OK and should… definitely be there, I wonder how they hit this? Could it be that something in our code is making run away requests filling up this queue? I doubt we have rate limiting in any of our client side javascript, so that could be the case. ``` App 20573 output: [2021-07-13 16:00:28 -0400 ] INFO "method=GET path=/pun/sys/dashboard/files/fs/somefile format=html controller=FilesController action=fs status=200 duration=173589.09 view=51566.48" [ W 2021-07-13 16:02:51.7054 56016/Ty age/Cor/Con/CheckoutSession.cpp:260 ]: [Client 14-14] Returning HTTP 503 due to: Request queue full (configured max. size: 100) [ W 2021-07-13 16:02:58.1125 56016/T10 age/Cor/Con/CheckoutSession.cpp:260 ]: [Client 15-14] Returning HTTP 503 due to: Request queue full (configured max. size: 100) App 20573 output: [2021-07-13 16:03:21 -0400 ] INFO "method=GET path=/pun/sys/dashboard/files/fs/somefile format=html controller=FilesController action=fs status=200 duration=172767.20 view=51266.80" [ W 2021-07-13 16:05:19.6208 56016/T8 age/Cor/Con/CheckoutSession.cpp:260 ]: [Client 1-15] Returning HTTP 503 due to: Request queue full (configured max. size: 100) ``` ┆Issue is synchronized with this [Asana task](https://app.asana.com/0/1201735133575781/1201737428708326) by [Unito](https://www.unito.io)

lost · July 12, 2023, 2:41pm

Hmm, you are right:

[root@ondemand cloud-user]# grep -r "Request queue full" /var/log/ondemand-nginx/*/
/var/log/ondemand-nginx/someuser/error.log:[ W 2023-07-13 00:07:47.4147 18086/Tk age/Cor/Con/CheckoutSession.cpp:266 ]: [Client 7-4] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
/var/log/ondemand-nginx/someuser/error.log:[ W 2023-07-13 01:15:03.0711 18086/Tm age/Cor/Con/CheckoutSession.cpp:266 ]: [Client 8-4] Returning HTTP 503 due to: Request queue full (configured max. size: 100)
/var/log/ondemand-nginx/someuser/error.log:[ W 2023-07-13 01:18:29.8312 18086/To age/Cor/Con/CheckoutSession.cpp:266 ]: [Client 9-4] Returning HTTP 503 due to: Request queue full (configured max. size: 100)

The OOD dashboard recovered, but trying to launch another jupyter notebook has made it hang again. However looking at the ondemand/data/sys/dashboard/batch_connect/sys/jupyter/slurm/output/07ec87c8-0d11-4057-8a23-ba430afedfee/output.log file the job appeared to launch fine, and squeue shows the job running. CPU looks almost idle, loads of memory, no swap.

Re. the path /etc/ood/config/apps/jupyter/slurm.yml, that does exist. I’d assumed the slurm part was from the cluster name which is “slurm”, i.e. in my open ondemand ansible configuration the openondemand_clusters is the ood-ansible clusters role variable. That bit of the configuration is definitely the same as other deployments.

jeff.ohrstrom · July 12, 2023, 3:27pm

This page has to issue squeue commands. I wonder if that’s the bit that’s slow?

lost · July 12, 2023, 3:33pm

Doesn’t seem to be:

[root@ondemand cloud-user]# time squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              5269 interacti ood-jupy    someuser R      56:19      1 interactive-0

real    0m0.019s

jeff.ohrstrom · July 12, 2023, 3:43pm

you can search /var/log/ondemand-nginx/$USER/error.log for execve to see the actual command we issue. It’s not simply squeue but formatted.

That’s the only thing that comes to mind. we issue that command to build the page, most other things should be in memory. I’d take a glance at what all commands you’re having to issue to build that page.

system · January 8, 2024, 3:43pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
VNC connection hangs Get Help question	9	358	September 26, 2022
Shell app dies exactly after 60 seconds of idleness Get Help question	5	353	October 26, 2023
Using ondemand -- can't enter the portal or ssh Get Help	5	285	October 29, 2024
OOD in a single workstation Get Help question	9	71	May 21, 2025
Apache confusing user sessions? Get Help	7	349	January 2, 2024

Debugging OOD hanging

Related topics