Hi folks,
I got Jupyter and RStudio working on our OOD, but almost every time I try to launch a session (of either package) I get the following:
sbatch: error: slurm_persist_conn_open_without_init: failed to open persistent connection to tempest-slurm1:6819: Connection timed out
sbatch: error: slurmdbd: Sending PersistInit msg: Connection timed out
sbatch: error: slurmdbd: Sending PersistInit msg: Connection timed out
sbatch: error: slurmdbd: DBD_GET_CLUSTERS failure: Connection timed out
sbatch: error: Problem talking to database
sbatch: error: There is a problem talking to the database: Connection timed out. Only local cluster communication is available, remove --cluster from your command line or contact your admin to resolve the problem.
Is there a way to increase the OOD timeout value for reaching the controller? Both OOD and the tempest-slurm1 controller are on the same subnet but both are VM’s. I’m playing with the arp cache timeout values, but if I can just OOD to wait about 3-5 more seconds it would alleviate this error until our new high speed vm enclosure environment is deployed.
After looking into this. I believe the issue is a configuration within slurm. I’m thinking specifically slurmdbd. I’m not an expert with slurm, but looks like the timeout is happening when attempting to connect to slurmdbd.
Looks like the default timeout value for slurdbd is 10 seconds. Can you up that value to see if you get any relief?
One more update. In the submit.yml.erb script for the app, please make sure you do not have ‘-M’ as a slurm argument. If you do, please remove it and let’s see if that helps.
Hey Gerald,
Thanks so much. Nailed it. In slurm.conf the TCPTimeout default is 2 seconds. TCPTimeout=10 fixed it. That might be a bit long but it gives it plenty of time, I’ve launched jobs every few minutes (to allow for the ARP cache to invalidate) and it hasn’t failed yet. Thanks!!