Hey, thx for the quick response!
I’m sharing a screenshot of the output from the portal.
We’ve already checked the Slurmctld and Slurmdbd logs, but didn’t find anything relevant related to the issue. The Slurmctld logs do show that the job was submitted successfully, and it also appears in the job list when running squeue.
OK I see it - you can tell by the card in yellow what’s wrong. I assume your cluster is using some form of submit_host to ssh into while issueing the sbatch command.
We rely heavily on parsing stderr and stdout for the job id/and or error output after issing sbatch.
It appears that we did not anticipate this string being in standard out. “This string” being ssh output about adding a host to the list of known hosts.
From what I can tell, you’re a submitting the job correctly, but we’re not able to correctly parse the job id because of this ssh output about adding that host to the list of known hosts. Seems like maybe you need to redirect ssh’s stdout & stderr while keeping sbatch’s stdout & stderr.
Yes that looks much better - what’s in the parenthesis () should be the job id.
Not sure why you wouldn’t be able to contact the slurm controller, but you can issue squeue -j 191 manually on the same machine (or ssh somesubmithost squeue -j 191) to replicate.
It’s important that you issue a similar command on the same exact machine as OOD does to replicate and debug your slurm controller error.
Also there may be something on the slurm controller’s side in it’s logs that indicates what the issue could be.
The Slurm controller appears to be running both on the OOD instance and the Head Node, and both seem to be functioning independently.
When we submit a job on the OOD instance using sbatch --wrap="hostname", the job doesn’t appear on the Login, Head, or Compute nodes — and the same happens in reverse.
We’ve also verified connectivity from the OOD instance to the Head Node on port 6819, where the Slurm controller is currently running.
PS: We resolved the “Unable to contact Slurm controller (connect failure)” issue by removing the additional Slurm instance running on the OOD server and updating the Slurm configuration to point to the Head Node.
You want to think of the OOD machine as a login node. Essentially you just want sbatch and squeue to work as they would through the CLI. It doesn’t need to be a controller node, just one that can submit jobs through sbatch and query jobs through squeue.
Yeah, We figured this out the squeue command on the OOD instance wasn’t showing jobs from the ParallelCluster, just as you mentioned earlier. It was pointing to the Slurm controller running locally on the OOD instance instead of the one on the Head Node.
Thanks a lot for the support, really appreciate it!