Frequency of SLURM calls from OOD

mcuma · March 12, 2024, 7:38pm

We have seen sporadic OOD login timeouts after one authenticates, which correspond to a log message like:

[Tue Mar 12 12:41:43.274829 2024] [proxy_http:error] [pid 185398:tid 139664266614528] (70007)The timeout specified has expired: [client 155.101.16.32:45342] AH01102: error reading status line from remote server httpd-UDS:0, referer: https://ondemand-class.chpc.utah.edu/pun/sys/dashboard/batch_connect/sessions

These messages show much more often than just the login timeouts which are quite rare, and we have tracked that to a timed out squeue call which happens roughly every 5-10 seconds for each logged in user. I guess the PUN is running squeue periodically to query the state of user’s jobs and update their status in OOD.

What is the default interval of these queries and can it be changed? I would like to experiment with making it longer to see if we continue seeing these timeout log messages.

Thanks,
Martin

jeff.ohrstrom · March 13, 2024, 1:21pm

You can set the environment variable POLL_DELAY in the env file. This is the time in milliseconds it’ll wait to query squeue or similar. It’s default is 10000 (10 seconds).

I just filed a ticket to make this a proper configuration, so you should see the actual configuration in the next release.

mcuma · March 14, 2024, 7:27pm

Thanks Jeff. Is there also a timeout for how long OOD waits for the return of the SLURM command? Perhaps we should increase that to get rid of the errors like I posted earlier?

jeff.ohrstrom · March 14, 2024, 7:32pm

No. I seem to recall slurm has it’s own timeout of 60 seconds to complete the command - though I’m now unable to find the documentation for the same. That’s the same length of the apache request timeout.

If you’re able to extend the timeout on the Slurm side, you’d have to likely extend the timeout in apache too.

system · September 10, 2024, 7:32pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Users slurm association change DOES NOT get refreshed within OOD Get Help question	3	359	December 7, 2022
SSH Shell - Timing out after 1 minute Get Help	8	234	March 31, 2024
Timeout with slurm controller Get Help ondemand2	7	474	February 11, 2023
Interactive jobs "disappearing" Get Help	7	614	May 19, 2022
Occasional random(?) PUN error.log blowups with millions of lines of WebSocket errors Get Help question	4	369	January 24, 2024

Frequency of SLURM calls from OOD

Related topics