OnDemand Dashboard not using SSH to run slurm commands (I think)

I’ve recently updated our OOD configuration and the infrastructure as code that sets everything up in AWS. Right now I’m getting problematic behavior from interactive apps. When I start an interactive session, the info card consistently displays it as being in an undetermined state. When I check the nginx log for my user, it’s getting an error from the squeue command it uses to get the job state, and the error it’s getting is the same as what I get when I run an squeue command targeting the cluster from the node running OOD. The problem is that the node running OOD isn’t supposed to be where the Slurm commands run (the squeue and other slurm commands being available is vestigial, and due to be cleaned up). The commands are supposed to be running via SSH on a login node.

The thing is that the actual job runs as expected. It’s queued up from the login node, and the output log file indicates that it’s running normally, it’s just that the dashboard isn’t getting the job status correctly. I suspect that something somewhere in the dashboard app is just misconfigured, but I don’t know where to look. Any pointers would be much appreciated.

Can you share your cluster.d file? I suspect you don’t have a submit_host parameter in it. The submit_host parameter is used here to ssh somewhere else to issue the commands.

Sure, it’s

---
v2:
  metadata:
    title: "pc-stage"
    hidden: false
  login:
    host: "<address of our login node load balancer>"
  job:
    adapter: "slurm"
    cluster: "pc-stage"
    bin: "/bin"
    bin_overrides:
      sbatch: "/etc/ood/config/bin_overrides.py"

That doesn’t include a submit_host, as you suspected. I tried adding a parameter under job, like so:

---
v2:
  metadata:
    title: "pc-stage"
    hidden: false
  login:
    host: "<address of our login node load balancer>"
  job:
    adapter: "slurm"
    cluster: "pc-stage"
    bin: "/bin"
    bin_overrides:
      sbatch: "/etc/ood/config/bin_overrides.py"
    submit_host: "<address of our login node load balancer>"

That didn’t change the behavior after hitting the “Restart web server” option in the dashboard, but maybe I need to do something else to make it update?

Oh, adding the submit_host did change what shows up in the logs. The excve command now includes an ssh command, and that’s running into problems. That’s still an issue, but it’s a different issue, which is progress. I’ll take a look and update, hopefully confirming I could use that parameter and some ssh fixes to resolve the issue.

OK yea, then also, squeue is not working, but sbatch is likely because of the bin_override. Is there a reason you didn’t supply the bin_override for squeue too?

That’s a great question, and not one that I knew to ask. We’ve been using a sample repository from AWS, and so far we haven’t had to mess around with this configuration. But it does seem odd that it only overrides one of the slurm commands. Here’s the bin_overrides.py that we’re working with:

#!/bin/python3
from getpass import getuser
from select import select
from sh import ssh, ErrorReturnCode
import logging
import os
import re
import sys
import yaml

'''
An example of a "bin_overrides" replacing Slurm "sbatch" for use with Open OnDemand.
Executes sbatch on the target cluster vs OOD node to get around painful experiences with sbatch + EFA.

Requirements:

- $USER must be able to SSH from web node to submit node without using a password
'''
logging.basicConfig(filename='/var/log/sbatch.log', level=logging.INFO)

USER = os.environ['USER']

def run_remote_sbatch(script,host_name, *argv):
  """
  @brief      SSH and submit the job from the submission node

  @param      script (str)  The script
  @parma      host_name (str) The hostname of the head node on which to execute the script
  @param      argv (list<str>)    The argument vector for sbatch

  @return     output (str) The merged stdout/stderr of the remote sbatch call
  """

  output = None

  try:
    result = ssh(
      '@'.join([USER, host_name]),
      '-oBatchMode=yes',  # ensure that SSH does not hang waiting for a password that will never be sent
      '-oUserKnownHostsFile=/dev/null', # ensure that SSH does not try to resolve the hostname of the remote node
      '-oStrictHostKeyChecking=no',
      '/opt/slurm/bin/sbatch',  # the real sbatch on the remote
      *argv,  # any arguments that sbatch should get
      _in=script,  # redirect the script's contents into stdin
      _err_to_out=True  # merge stdout and stderr
    )

    output = result
    logging.info(output)
  except ErrorReturnCode as e:
    output = e
    logging.error(output)
    print(output)
    sys.exit(e.exit_code)

  return output

def load_script():
  """
  @brief      Loads a script from stdin.

  With OOD and Slurm the user's script is read from disk and passed to sbatch via stdin
  https://github.com/OSC/ood_core/blob/5b4d93636e0968be920cf409252292d674cc951d/lib/ood_core/job/adapters/slurm.rb#L138-L148

  @return     script (str) The script content
  """
  # Do not hang waiting for stdin that is not coming
  if not select([sys.stdin], [], [], 0.0)[0]:
    logging.error('No script available on stdin!')
    sys.exit(1)

  return sys.stdin.read()

def get_cluster_host(cluster_name):
  with open(f"/etc/ood/config/clusters.d/{cluster_name}.yml", "r") as stream:
    try:
      config_file=yaml.safe_load(stream)
    except yaml.YAMLError as e:
      logging.error(e)
  return config_file["v2"]["login"]["host"]

def main():
  """
  @brief SSHs from web node to submit node and executes the remote sbatch.
  """
  host_name=get_cluster_host(sys.argv[-1])
  output = run_remote_sbatch(
    load_script(),
    host_name,
    sys.argv[1:]
  )

  print(output)

if __name__ == '__main__':
  main()

The errors about SSH connections that I’m getting now are also weird:

App 350413 output: [2025-07-23 15:59:14 -0400 ]  INFO "execve = [{}, \"ssh\", \"-p\", \"22\", \"-o\", \"BatchMode=yes\", \"-o\", \"UserKnownHostsFile=/dev/null\", \"-o\", \"StrictHostKeyChecking=yes\", \"<login node load balancer address>\", \"/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%a\\u001F%A\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\\u001F%b\", \"-j\", \"8\", \"-M\", \"pc-stage\"]"
App 350413 output: [2025-07-23 15:59:14 -0400 ] ERROR "No ED25519 host key is known for <login node load balancer address> and you have requested strict checking.\r\nHost key verification failed."

It looks like the command is both setting UserKnownHostsFile to /dev/null and using StrictHostKeyChecking set to yes. That seems like it would never work, no?

Yea that’s strange. I don’t recall why we did that or if we even considered it after the fact.

There’s a config for host checking - strict_host_checking - but no config for known host file.

Seems like you need to set strict_host_checking to false.

Yeah, adding that (and changing the bin location and removing the bin overrides) did the trick. For posterity, the working cluster config now looks like this:

---
v2:
  metadata:
    title: "pc-stage"
    hidden: false
  login:
    host: "<login node load balancer address>"
  job:
    adapter: "slurm"
    cluster: "pc-stage"
    bin: "/opt/slurm/bin"
    submit_host: "<login node load balancer address>"
    strict_host_checking: false