Timeout with slurm controller

Hi folks,
I got Jupyter and RStudio working on our OOD, but almost every time I try to launch a session (of either package) I get the following:

sbatch: error: slurm_persist_conn_open_without_init: failed to open persistent connection to tempest-slurm1:6819: Connection timed out
sbatch: error: slurmdbd: Sending PersistInit msg: Connection timed out
sbatch: error: slurmdbd: Sending PersistInit msg: Connection timed out
sbatch: error: slurmdbd: DBD_GET_CLUSTERS failure: Connection timed out
sbatch: error: Problem talking to database
sbatch: error: There is a problem talking to the database: Connection timed out. Only local cluster communication is available, remove --cluster from your command line or contact your admin to resolve the problem.

Is there a way to increase the OOD timeout value for reaching the controller? Both OOD and the tempest-slurm1 controller are on the same subnet but both are VM’s. I’m playing with the arp cache timeout values, but if I can just OOD to wait about 3-5 more seconds it would alleviate this error until our new high speed vm enclosure environment is deployed.

Kenny
kenny.hanson@montana.edu

Hi Kenny.

Thanks for the post. I am currently researching this question as I do not know off the top of my head.

Thanks,
-gerald

Hi Kenny.

After looking into this. I believe the issue is a configuration within slurm. I’m thinking specifically slurmdbd. I’m not an expert with slurm, but looks like the timeout is happening when attempting to connect to slurmdbd.

Looks like the default timeout value for slurdbd is 10 seconds. Can you up that value to see if you get any relief?

Here’s the slurmdbd.conf reference site: Slurm Workload Manager - slurmdbd.conf

Thanks,
-gerald

1 Like

Hi Kenny.

One more update. In the submit.yml.erb script for the app, please make sure you do not have ‘-M’ as a slurm argument. If you do, please remove it and let’s see if that helps.

Thanks,
-gerald

1 Like

Hey Gerald,
Thanks so much. Nailed it. In slurm.conf the TCPTimeout default is 2 seconds. TCPTimeout=10 fixed it. That might be a bit long but it gives it plenty of time, I’ve launched jobs every few minutes (to allow for the ARP cache to invalidate) and it hasn’t failed yet. Thanks!!

Kenny

1 Like

That’s great news. Thanks for letting us know.

My first time setting up a symphony of systems of this magnitude. I’m happier than a tornado in a trailer park :stuck_out_tongue:
Kenny

2 Likes