Timeout with slurm controller

kenny.hanson · August 15, 2022, 4:58pm

Hi folks,
I got Jupyter and RStudio working on our OOD, but almost every time I try to launch a session (of either package) I get the following:

sbatch: error: slurm_persist_conn_open_without_init: failed to open persistent connection to tempest-slurm1:6819: Connection timed out
sbatch: error: slurmdbd: Sending PersistInit msg: Connection timed out
sbatch: error: slurmdbd: Sending PersistInit msg: Connection timed out
sbatch: error: slurmdbd: DBD_GET_CLUSTERS failure: Connection timed out
sbatch: error: Problem talking to database
sbatch: error: There is a problem talking to the database: Connection timed out. Only local cluster communication is available, remove --cluster from your command line or contact your admin to resolve the problem.

Is there a way to increase the OOD timeout value for reaching the controller? Both OOD and the tempest-slurm1 controller are on the same subnet but both are VM’s. I’m playing with the arp cache timeout values, but if I can just OOD to wait about 3-5 more seconds it would alleviate this error until our new high speed vm enclosure environment is deployed.

Kenny
kenny.hanson@montana.edu

gbyrket · August 15, 2022, 6:29pm

Hi Kenny.

Thanks for the post. I am currently researching this question as I do not know off the top of my head.

Thanks,
-gerald

gbyrket · August 15, 2022, 7:07pm

Hi Kenny.

After looking into this. I believe the issue is a configuration within slurm. I’m thinking specifically slurmdbd. I’m not an expert with slurm, but looks like the timeout is happening when attempting to connect to slurmdbd.

Looks like the default timeout value for slurdbd is 10 seconds. Can you up that value to see if you get any relief?

Here’s the slurmdbd.conf reference site: Slurm Workload Manager - slurmdbd.conf

Thanks,
-gerald

gbyrket · August 15, 2022, 8:30pm

Hi Kenny.

One more update. In the submit.yml.erb script for the app, please make sure you do not have ‘-M’ as a slurm argument. If you do, please remove it and let’s see if that helps.

Thanks,
-gerald

kenny.hanson · August 15, 2022, 8:41pm

Hey Gerald,
Thanks so much. Nailed it. In slurm.conf the TCPTimeout default is 2 seconds. TCPTimeout=10 fixed it. That might be a bit long but it gives it plenty of time, I’ve launched jobs every few minutes (to allow for the ARP cache to invalidate) and it hasn’t failed yet. Thanks!!

Kenny

gbyrket · August 15, 2022, 8:42pm

That’s great news. Thanks for letting us know.

kenny.hanson · August 15, 2022, 8:43pm

My first time setting up a symphony of systems of this magnitude. I’m happier than a tornado in a trailer park
Kenny

system · February 11, 2023, 8:44pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SSH Shell - Timing out after 1 minute Get Help	8	234	March 31, 2024
Slurm Error when Launching Jupyter Get Help	2	109	November 25, 2024
Jobs not showing up due to "Socket timed out" error Get Help ondemand2	2	462	October 8, 2023
Systemd adaptor job is not timing out Get Help	5	15	April 10, 2025
Issue Getting Jobs to Submit to the Slurm Cluster via OnDemand (RHEL 8.6, OnDemand 2.0.27) Get Help ondemand2 , question	12	879	January 4, 2023

Timeout with slurm controller

Related topics