Slurm job submission errors

That did indeed work and I did not need strict_host_checking: false.

For anyone else with this problem, trying to submit jobs through Slurm to a cluster when OOD is hosted on a different server from the cluster here is what I believe to be a full fix.

  1. Copy the cluster’s slurm.conf file to the server hosting OOD
    Note: In our slurm.conf file we had one line that referenced <servername>.cluster
    We modified the slurm.conf on the OOD server to remove the .cluster part
    because it caused some errors as that domain name was local to our
    cluster

    In the cluster .yml file, the conf: value needs to refer to the location of slurm.conf
    on the OOD server

  2. In the cluster .yml file make sure under the job section you define submit_host:
    According to one of the comments in this thread from OOD folks
    if you don’t define that, then you need Slurm running on the OOD server

  3. On the OOD server, check /etc/ssh to see if a known_hosts file exists.
    a) If not create one
    b) Add the cluster’s public key to the known_hosts file on the OOD server.
    When I ran some of the rake tests I noticed some comments about
    ECDSA keys so we manually copied/pasted our cluster’s public ECSDA
    key into the OOD server known_hosts. The cluster’s key file was also in the
    /etc/ssh directory.

    The format we used in known_hosts was <serverIP>,<serverhostname><space><publickey>