Jobs not showing up due to "Socket timed out" error

We have version 2.0.31 installed on a cluster running Slurm. Sometimes the Slurm scheduler is overloaded resulting in an error when submitting interactive app jobs:
sbatch error: Batch job submission failed: Socket timed out on send/recv operation
However, the job does get scheduled with Slurm and shows up in the OOD “Active Jobs” section but doesn’t show up in the “My Interactive Sessions” section. Users are unaware that the job is running due to it not showing up in the “My Interactive Sessions” section.

Hey sorry for the trouble.

It seems like there might be a communication issue between Slurm and OOD when the scheduler is overloaded, so the easiest fix might be to extend the timeout with Slurm.

I’ll have to look into this more as I’m not sure off the top of my head where this happens or how to extend the network timeout to avoid this.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.