Hi,
I have a cluster configuration file with submit_host
set to a node, so that the OOD portal can ssh to it and submit SLURM commands.
If this one node goes down for whatever reason however, then no new jobs can be submitted via OOD until it is fixed or I manually edit the file to choose a new host.
I want to make this more robust, by specifying multiple nodes in the submit_host
option so that if the first node fails then there are other nodes to fall back on for job submission. Looking at the docs for this, it doesn’t look like I can do this.
Does anyone know of a workaround for this?
Thanks,
James