Slurm job submission errors

I am trying to submit a Slurm job through OOD and I am getting the following error:

An error occurred when submitting jobs for simulation 1: sbatch: error: s_p_parse_file: cannot stat file /opt/slurm/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
sbatch: error: ClusterName needs to be specified
sbatch: fatal: Unable to process configuration file

I am using the following as documentation:
Setup SLURM

The OOD server is a separate server from the cluster.
The contents of my .yml file for the cluster on the OOD server is as follows:

---
v2:
  metadata:
    title: "F&M Research Cluster"
    url: "https://dorcfandm.github.io/rcs.github.io/"
  login:
    host: "rcs-scsn.fandm.edu"
  job:
    adapter: "slurm"
    cluster: "rcs-sc"
    bin: "/usr/bin"
    conf: "/opt/slurm/slurm.conf"

I also tried to test the setup via the command-line using:
su $USER -c ‘scl enable ondemand – bin/rake test:jobs:cluster1 RAILS_ENV=production’

with the following output:

 Rails Error: Unable to access log file. Please ensure that /var/www/ood/apps/sys/dashboard/log/production.log exists and is writable (ie, make it writable for user and group: chmod 0664 /var/www/ood/apps/sys/dashboard/log/production.log). The log level has been raised to WARN and the output directed to STDERR until the problem is fixed.
mkdir -p /home/user/test_jobs
Testing cluster 'cluster'...
Submitting job...
[2023-02-07 10:04:42 -0500 ]  INFO "execve = [{\"SLURM_CONF\"=>\"/opt/slurm/slurm.conf\"}, \"/usr/bin/sbatch\", \"-D\", \"/home/user/test_jobs\", \"-J\", \"test_jobs_cluster\", \"-o\", \"/home/user/test_jobs/output_cluster_2023_02_07t10_04_42_05_00_log\", \"-t\", \"00:01:00\", \"--export\", \"NONE\", \"--parsable\", \"-M\", \"rcs-sc\"]"
rake aborted!
OodCore::JobAdapterError: sbatch: error: s_p_parse_file: cannot stat file /opt/slurm/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
sbatch: error: ClusterName needs to be specified
sbatch: fatal: Unable to process configuration file
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.29/gems/ood_core-0.22.0/lib/ood_core/job/adapters/slurm.rb:477:in `rescue in submit'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.29/gems/ood_core-0.22.0/lib/ood_core/job/adapters/slurm.rb:415:in `submit'
/var/www/ood/apps/sys/dashboard/lib/tasks/test.rake:30:in `block (4 levels) in <top (required)>'

Caused by:
OodCore::Job::Adapters::Slurm::Batch::Error: sbatch: error: s_p_parse_file: cannot stat file /opt/slurm/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
sbatch: error: ClusterName needs to be specified
sbatch: fatal: Unable to process configuration file
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.29/gems/ood_core-0.22.0/lib/ood_core/job/adapters/slurm.rb:335:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.29/gems/ood_core-0.22.0/lib/ood_core/job/adapters/slurm.rb:244:in `submit_string'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.29/gems/ood_core-0.22.0/lib/ood_core/job/adapters/slurm.rb:475:in `submit'
/var/www/ood/apps/sys/dashboard/lib/tasks/test.rake:30:in `block (4 levels) in <top (required)>'
Tasks: TOP => test:jobs:cluster
(See full trace by running task with --trace)

/opt/slurm/slurm.conf is indeed the correct location for slurm.conf on our cluster and the file permission is set to world readable. There is only one cluster setup and it’s name is rcs-sc

Thank you in advance for any help

Hi and sorry for the trouble.

Looking at the log outputs, it looks like something is off with the cluster name you are submitting to, which needs to match the name given in SLURM.

What is the cluster name in the slurm.conf file?

On the cluster in slurm.conf the cluster name is:

################################################
#                   CONTROL                    #
################################################
ClusterName=rcs-sc

On my OOD server (which is not part of the cluster), I set cluster: “rcs-sc”

Do you have those slurm.conf files on the actual web node running OOD? They will need to be there.

No I don’t have it on the OOD webserver. It wasn’t clear from the documentation that it needed to be on the webserver. Does Slurm also need to be installed on the webserver too or do I just need to copy the slurm.conf from my cluster to the webserver?

You won’t need slurm on the web host.

I apologize that wasn’t clear in the docs. The way to think about it is Open OnDemand needs to know the configuration details of the cluster in order to submit jobs to it, and the cluster needs to know its own configuration details in order to manage the jobs that are submitted to it.

So, this is why you will need that slurm.conf in two places, but won’t need to do a full SLURM install on the web host with OOD.

Ok, sounds good. I will copy it over and close the ticket if that solves the issue

My apologies again, I can’t even understand our own docs on this either.

So, for the current setup you have you will need the SLURM binaries on the web host with OOD.

I’m really sorry for the confusion.

Now, there is a way around having that binary installed on the web host with OOD, but it is not totally documented yet though it works in 2.0.

If you look here:
https://osc.github.io/ood-documentation/develop/installation/resource-manager/slurm.html

You’ll see an option for submit_host that does allow you to skip having the binary on the web node with OOD and instruct the sbatch commands to be run on another host.

I will add the submit_host option to my cluster.yml and see if that helps

My current cluster.yml file is:

---
v2:
  metadata:
    title: "F&M Research Cluster"
    url: "https://dorcfandm.github.io/rcs.github.io/"
  login:
    host: "rcs-scsn.fandm.edu"
  job:
    adapter: "slurm"
    submit_host: "rcs-scsn.fandm.edu"
    bin: "/usr/bin"
    conf: "/opt/slurm/slurm.conf"

But now it errors:
An error occurred when submitting jobs for simulation 1: sbatch: error: get_addr_info: getaddrinfo() failed: Name or service not known
sbatch: error: slurm_set_addr: Unable to resolve “rcs-sc.cluster”
sbatch: error: slurm_get_port: Address family ‘0’ not supported
sbatch: error: Error connecting, bad data: family = 0, port = 0
sbatch: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:rcs-sc.cluster:6819: No error
sbatch: error: Sending PersistInit msg: No error
sbatch: error: get_addr_info: getaddrinfo() failed: Name or service not known
sbatch: error: slurm_set_addr: Unable to resolve “rcs-sc.cluster”
sbatch: error: slurm_get_port: Address family ‘0’ not supported
sbatch: error: Error connecting, bad data: family = 0, port = 0
sbatch: error: Sending PersistInit msg: No error
sbatch: error: DBD_GET_CLUSTERS failure: No error
sbatch: error: Problem talking to database
sbatch: error: ‘rcs-sc’ can’t be reached now, or it is an invalid entry for --cluster. Use ‘sacctmgr list clusters’ to see available clusters.

  1. Do I need to rebuilt the portal every time I change cluster.yml?

  2. I suspect the issue is with this line in our slurm.conf
    AccountingStorageHost=rcs-sc.cluster

That name (rcs-sc.cluster) resolves fine on the cluster (to a local 10. address) but it won’t resolve from any other server outside the cluster

You shouldn’t have to rebuild the portal changing the cluster.yml, though restarting httpd may not be a bad idea just to be safe.

I’m not sure why that name resolution is failing, though. That nothing outside the cluster can resolve it is clearly the problem though.

What happens if you set the cluster ip and host in /etc/hosts and try again? That will at least clue you in on if it’s a config problem or something with name resolution.

So I’m slowly moving in the correct direction. I edited the slurm.conf on the OOD server
and just removed the .cluster portion on that line and it was closer to working. Now I’m getting an error about keys:

An error occurred when submitting jobs for simulation 1: No ECDSA host key is known for rcs-scsn.fandm.edu and you have requested strict checking.
Host key verification failed.

Through the OOD webportal I can start an ssh session to the cluster as an actual user. I also know that the username I am using has all the keys synced up on the cluster. I can also ssh directly from the commandline on the OOD server as the username.

Is the request being made under a different user (like ood) and I need to add that user to our cluster?

I’m not sure. This may a problem where something failed to sync.

Do you see an entry for the cluster in the known_hosts on the OOD web host?

If you manually add the public key for the cluster to the web node’s known_hosts, what happens if you try to submit then? It looks like it just needs that host and its key added. You can use ssh-keyscan with the cluster hostname to grab them.

I tried the following

  1. From the command line on the OOD server do ssh kegen -t ECDSA
  2. SSH copy id to our cluster
  3. SSH’s to cluster and verified I didn’t need password.
  4. From the OOD webportal started an ssh session to the cluster and verified it didn’t ask for password.
  5. Tested job submit and it still failed same issue

I dug around the OOD discourse pages and found that apparently there’s an option (one that I didn’t see in the documentation) you can add to the cluster.yml file
strict_host_checking:

I set this value to false and then I could submit jobs and also see jobs in the queue.

I also noticed that when running the rake test mentioned above it seems to be setting "UserKnownHostsFile=/dev/null" which I think is also a problem.
Is there a setting I can add in the configuration file to host file to something like /home/$USER/.ssh/known_hosts and then maybe I can turn strict checking back on? My concern is that by turning off strict checking that it presents some security risks

I don’t think I see anything about ensuring the cluster and its public key have been added to the web host’s known_hosts file in the updates above though.

Could you see if the cluster has an entry in the web host’s known_hosts file to start?

That did indeed work and I did not need strict_host_checking: false.

For anyone else with this problem, trying to submit jobs through Slurm to a cluster when OOD is hosted on a different server from the cluster here is what I believe to be a full fix.

  1. Copy the cluster’s slurm.conf file to the server hosting OOD
    Note: In our slurm.conf file we had one line that referenced <servername>.cluster
    We modified the slurm.conf on the OOD server to remove the .cluster part
    because it caused some errors as that domain name was local to our
    cluster

    In the cluster .yml file, the conf: value needs to refer to the location of slurm.conf
    on the OOD server

  2. In the cluster .yml file make sure under the job section you define submit_host:
    According to one of the comments in this thread from OOD folks
    if you don’t define that, then you need Slurm running on the OOD server

  3. On the OOD server, check /etc/ssh to see if a known_hosts file exists.
    a) If not create one
    b) Add the cluster’s public key to the known_hosts file on the OOD server.
    When I ran some of the rake tests I noticed some comments about
    ECDSA keys so we manually copied/pasted our cluster’s public ECSDA
    key into the OOD server known_hosts. The cluster’s key file was also in the
    /etc/ssh directory.

    The format we used in known_hosts was <serverIP>,<serverhostname><space><publickey>

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.