Specifying multiple hosts with the submit_host option

Hi,

I have a cluster configuration file with submit_host set to a node, so that the OOD portal can ssh to it and submit SLURM commands.

If this one node goes down for whatever reason however, then no new jobs can be submitted via OOD until it is fixed or I manually edit the file to choose a new host.

I want to make this more robust, by specifying multiple nodes in the submit_host option so that if the first node fails then there are other nodes to fall back on for job submission. Looking at the docs for this, it doesn’t look like I can do this.

Does anyone know of a workaround for this?

Thanks,

James

I would use a DNS name that can resolve to multiple machines. Though of course, DNS resolution may not know that a given machine is down, so even when it’s down ssh may still route there.

Thanks for the quick response, Jeff! We’ll give that a go.