That’s a great question, and not one that I knew to ask. We’ve been using a sample repository from AWS, and so far we haven’t had to mess around with this configuration. But it does seem odd that it only overrides one of the slurm commands. Here’s the bin_overrides.py
that we’re working with:
#!/bin/python3
from getpass import getuser
from select import select
from sh import ssh, ErrorReturnCode
import logging
import os
import re
import sys
import yaml
'''
An example of a "bin_overrides" replacing Slurm "sbatch" for use with Open OnDemand.
Executes sbatch on the target cluster vs OOD node to get around painful experiences with sbatch + EFA.
Requirements:
- $USER must be able to SSH from web node to submit node without using a password
'''
logging.basicConfig(filename='/var/log/sbatch.log', level=logging.INFO)
USER = os.environ['USER']
def run_remote_sbatch(script,host_name, *argv):
"""
@brief SSH and submit the job from the submission node
@param script (str) The script
@parma host_name (str) The hostname of the head node on which to execute the script
@param argv (list<str>) The argument vector for sbatch
@return output (str) The merged stdout/stderr of the remote sbatch call
"""
output = None
try:
result = ssh(
'@'.join([USER, host_name]),
'-oBatchMode=yes', # ensure that SSH does not hang waiting for a password that will never be sent
'-oUserKnownHostsFile=/dev/null', # ensure that SSH does not try to resolve the hostname of the remote node
'-oStrictHostKeyChecking=no',
'/opt/slurm/bin/sbatch', # the real sbatch on the remote
*argv, # any arguments that sbatch should get
_in=script, # redirect the script's contents into stdin
_err_to_out=True # merge stdout and stderr
)
output = result
logging.info(output)
except ErrorReturnCode as e:
output = e
logging.error(output)
print(output)
sys.exit(e.exit_code)
return output
def load_script():
"""
@brief Loads a script from stdin.
With OOD and Slurm the user's script is read from disk and passed to sbatch via stdin
https://github.com/OSC/ood_core/blob/5b4d93636e0968be920cf409252292d674cc951d/lib/ood_core/job/adapters/slurm.rb#L138-L148
@return script (str) The script content
"""
# Do not hang waiting for stdin that is not coming
if not select([sys.stdin], [], [], 0.0)[0]:
logging.error('No script available on stdin!')
sys.exit(1)
return sys.stdin.read()
def get_cluster_host(cluster_name):
with open(f"/etc/ood/config/clusters.d/{cluster_name}.yml", "r") as stream:
try:
config_file=yaml.safe_load(stream)
except yaml.YAMLError as e:
logging.error(e)
return config_file["v2"]["login"]["host"]
def main():
"""
@brief SSHs from web node to submit node and executes the remote sbatch.
"""
host_name=get_cluster_host(sys.argv[-1])
output = run_remote_sbatch(
load_script(),
host_name,
sys.argv[1:]
)
print(output)
if __name__ == '__main__':
main()
The errors about SSH connections that I’m getting now are also weird:
App 350413 output: [2025-07-23 15:59:14 -0400 ] INFO "execve = [{}, \"ssh\", \"-p\", \"22\", \"-o\", \"BatchMode=yes\", \"-o\", \"UserKnownHostsFile=/dev/null\", \"-o\", \"StrictHostKeyChecking=yes\", \"<login node load balancer address>\", \"/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%a\\u001F%A\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\\u001F%b\", \"-j\", \"8\", \"-M\", \"pc-stage\"]"
App 350413 output: [2025-07-23 15:59:14 -0400 ] ERROR "No ED25519 host key is known for <login node load balancer address> and you have requested strict checking.\r\nHost key verification failed."
It looks like the command is both setting UserKnownHostsFile
to /dev/null
and using StrictHostKeyChecking
set to yes
. That seems like it would never work, no?