Frequency of SLURM calls from OOD

We have seen sporadic OOD login timeouts after one authenticates, which correspond to a log message like:

[Tue Mar 12 12:41:43.274829 2024] [proxy_http:error] [pid 185398:tid 139664266614528] (70007)The timeout specified has expired: [client 155.101.16.32:45342] AH01102: error reading status line from remote server httpd-UDS:0, referer: https://ondemand-class.chpc.utah.edu/pun/sys/dashboard/batch_connect/sessions

These messages show much more often than just the login timeouts which are quite rare, and we have tracked that to a timed out squeue call which happens roughly every 5-10 seconds for each logged in user. I guess the PUN is running squeue periodically to query the state of user’s jobs and update their status in OOD.

What is the default interval of these queries and can it be changed? I would like to experiment with making it longer to see if we continue seeing these timeout log messages.

Thanks,
Martin

You can set the environment variable POLL_DELAY in the env file. This is the time in milliseconds it’ll wait to query squeue or similar. It’s default is 10000 (10 seconds).

I just filed a ticket to make this a proper configuration, so you should see the actual configuration in the next release.

Thanks Jeff. Is there also a timeout for how long OOD waits for the return of the SLURM command? Perhaps we should increase that to get rid of the errors like I posted earlier?

No. I seem to recall slurm has it’s own timeout of 60 seconds to complete the command - though I’m now unable to find the documentation for the same. That’s the same length of the apache request timeout.

If you’re able to extend the timeout on the Slurm side, you’d have to likely extend the timeout in apache too.