Slurm job submission errors

aweaver1fandm · February 7, 2023, 3:19pm

I am trying to submit a Slurm job through OOD and I am getting the following error:

An error occurred when submitting jobs for simulation 1: sbatch: error: s_p_parse_file: cannot stat file /opt/slurm/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
sbatch: error: ClusterName needs to be specified
sbatch: fatal: Unable to process configuration file

I am using the following as documentation:
Setup SLURM

The OOD server is a separate server from the cluster.
The contents of my .yml file for the cluster on the OOD server is as follows:

---
v2:
  metadata:
    title: "F&M Research Cluster"
    url: "https://dorcfandm.github.io/rcs.github.io/"
  login:
    host: "rcs-scsn.fandm.edu"
  job:
    adapter: "slurm"
    cluster: "rcs-sc"
    bin: "/usr/bin"
    conf: "/opt/slurm/slurm.conf"

I also tried to test the setup via the command-line using:
su $USER -c ‘scl enable ondemand – bin/rake test:jobs:cluster1 RAILS_ENV=production’

with the following output:

 Rails Error: Unable to access log file. Please ensure that /var/www/ood/apps/sys/dashboard/log/production.log exists and is writable (ie, make it writable for user and group: chmod 0664 /var/www/ood/apps/sys/dashboard/log/production.log). The log level has been raised to WARN and the output directed to STDERR until the problem is fixed.
mkdir -p /home/user/test_jobs
Testing cluster 'cluster'...
Submitting job...
[2023-02-07 10:04:42 -0500 ]  INFO "execve = [{\"SLURM_CONF\"=>\"/opt/slurm/slurm.conf\"}, \"/usr/bin/sbatch\", \"-D\", \"/home/user/test_jobs\", \"-J\", \"test_jobs_cluster\", \"-o\", \"/home/user/test_jobs/output_cluster_2023_02_07t10_04_42_05_00_log\", \"-t\", \"00:01:00\", \"--export\", \"NONE\", \"--parsable\", \"-M\", \"rcs-sc\"]"
rake aborted!
OodCore::JobAdapterError: sbatch: error: s_p_parse_file: cannot stat file /opt/slurm/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
sbatch: error: ClusterName needs to be specified
sbatch: fatal: Unable to process configuration file
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.29/gems/ood_core-0.22.0/lib/ood_core/job/adapters/slurm.rb:477:in `rescue in submit'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.29/gems/ood_core-0.22.0/lib/ood_core/job/adapters/slurm.rb:415:in `submit'
/var/www/ood/apps/sys/dashboard/lib/tasks/test.rake:30:in `block (4 levels) in <top (required)>'

Caused by:
OodCore::Job::Adapters::Slurm::Batch::Error: sbatch: error: s_p_parse_file: cannot stat file /opt/slurm/slurm.conf: No such file or directory, retrying in 1sec up to 60sec
sbatch: error: ClusterName needs to be specified
sbatch: fatal: Unable to process configuration file
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.29/gems/ood_core-0.22.0/lib/ood_core/job/adapters/slurm.rb:335:in `call'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.29/gems/ood_core-0.22.0/lib/ood_core/job/adapters/slurm.rb:244:in `submit_string'
/opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.29/gems/ood_core-0.22.0/lib/ood_core/job/adapters/slurm.rb:475:in `submit'
/var/www/ood/apps/sys/dashboard/lib/tasks/test.rake:30:in `block (4 levels) in <top (required)>'
Tasks: TOP => test:jobs:cluster
(See full trace by running task with --trace)

/opt/slurm/slurm.conf is indeed the correct location for slurm.conf on our cluster and the file permission is set to world readable. There is only one cluster setup and it’s name is rcs-sc

Thank you in advance for any help

travert · February 7, 2023, 4:24pm

Hi and sorry for the trouble.

Looking at the log outputs, it looks like something is off with the cluster name you are submitting to, which needs to match the name given in SLURM.

What is the cluster name in the slurm.conf file?

aweaver1fandm · February 7, 2023, 4:29pm

On the cluster in slurm.conf the cluster name is:

################################################
#                   CONTROL                    #
################################################
ClusterName=rcs-sc

On my OOD server (which is not part of the cluster), I set cluster: “rcs-sc”

travert · February 7, 2023, 4:56pm

Do you have those slurm.conf files on the actual web node running OOD? They will need to be there.

aweaver1fandm · February 7, 2023, 5:01pm

No I don’t have it on the OOD webserver. It wasn’t clear from the documentation that it needed to be on the webserver. Does Slurm also need to be installed on the webserver too or do I just need to copy the slurm.conf from my cluster to the webserver?

travert · February 7, 2023, 5:02pm

You won’t need slurm on the web host.

I apologize that wasn’t clear in the docs. The way to think about it is Open OnDemand needs to know the configuration details of the cluster in order to submit jobs to it, and the cluster needs to know its own configuration details in order to manage the jobs that are submitted to it.

So, this is why you will need that slurm.conf in two places, but won’t need to do a full SLURM install on the web host with OOD.

aweaver1fandm · February 7, 2023, 5:04pm

Ok, sounds good. I will copy it over and close the ticket if that solves the issue

travert · February 7, 2023, 5:19pm

My apologies again, I can’t even understand our own docs on this either.

So, for the current setup you have you will need the SLURM binaries on the web host with OOD.

I’m really sorry for the confusion.

travert · February 7, 2023, 5:25pm

Now, there is a way around having that binary installed on the web host with OOD, but it is not totally documented yet though it works in 2.0.

If you look here:
https://osc.github.io/ood-documentation/develop/installation/resource-manager/slurm.html

You’ll see an option for submit_host that does allow you to skip having the binary on the web node with OOD and instruct the sbatch commands to be run on another host.

aweaver1fandm · February 7, 2023, 5:27pm

I will add the submit_host option to my cluster.yml and see if that helps

aweaver1fandm · February 7, 2023, 5:39pm

My current cluster.yml file is:

---
v2:
  metadata:
    title: "F&M Research Cluster"
    url: "https://dorcfandm.github.io/rcs.github.io/"
  login:
    host: "rcs-scsn.fandm.edu"
  job:
    adapter: "slurm"
    submit_host: "rcs-scsn.fandm.edu"
    bin: "/usr/bin"
    conf: "/opt/slurm/slurm.conf"

But now it errors:
An error occurred when submitting jobs for simulation 1: sbatch: error: get_addr_info: getaddrinfo() failed: Name or service not known
sbatch: error: slurm_set_addr: Unable to resolve “rcs-sc.cluster”
sbatch: error: slurm_get_port: Address family ‘0’ not supported
sbatch: error: Error connecting, bad data: family = 0, port = 0
sbatch: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:rcs-sc.cluster:6819: No error
sbatch: error: Sending PersistInit msg: No error
sbatch: error: get_addr_info: getaddrinfo() failed: Name or service not known
sbatch: error: slurm_set_addr: Unable to resolve “rcs-sc.cluster”
sbatch: error: slurm_get_port: Address family ‘0’ not supported
sbatch: error: Error connecting, bad data: family = 0, port = 0
sbatch: error: Sending PersistInit msg: No error
sbatch: error: DBD_GET_CLUSTERS failure: No error
sbatch: error: Problem talking to database
sbatch: error: ‘rcs-sc’ can’t be reached now, or it is an invalid entry for --cluster. Use ‘sacctmgr list clusters’ to see available clusters.

Do I need to rebuilt the portal every time I change cluster.yml?
I suspect the issue is with this line in our slurm.conf
AccountingStorageHost=rcs-sc.cluster

That name (rcs-sc.cluster) resolves fine on the cluster (to a local 10. address) but it won’t resolve from any other server outside the cluster

travert · February 7, 2023, 5:50pm

You shouldn’t have to rebuild the portal changing the cluster.yml, though restarting httpd may not be a bad idea just to be safe.

I’m not sure why that name resolution is failing, though. That nothing outside the cluster can resolve it is clearly the problem though.

What happens if you set the cluster ip and host in /etc/hosts and try again? That will at least clue you in on if it’s a config problem or something with name resolution.

aweaver1fandm · February 7, 2023, 5:59pm

So I’m slowly moving in the correct direction. I edited the slurm.conf on the OOD server
and just removed the .cluster portion on that line and it was closer to working. Now I’m getting an error about keys:

An error occurred when submitting jobs for simulation 1: No ECDSA host key is known for rcs-scsn.fandm.edu and you have requested strict checking.
Host key verification failed.

Through the OOD webportal I can start an ssh session to the cluster as an actual user. I also know that the username I am using has all the keys synced up on the cluster. I can also ssh directly from the commandline on the OOD server as the username.

Is the request being made under a different user (like ood) and I need to add that user to our cluster?

travert · February 7, 2023, 7:23pm

I’m not sure. This may a problem where something failed to sync.

Do you see an entry for the cluster in the known_hosts on the OOD web host?

If you manually add the public key for the cluster to the web node’s known_hosts, what happens if you try to submit then? It looks like it just needs that host and its key added. You can use ssh-keyscan with the cluster hostname to grab them.

aweaver1fandm · February 7, 2023, 8:30pm

I tried the following

From the command line on the OOD server do ssh kegen -t ECDSA
SSH copy id to our cluster
SSH’s to cluster and verified I didn’t need password.
From the OOD webportal started an ssh session to the cluster and verified it didn’t ask for password.
Tested job submit and it still failed same issue

I dug around the OOD discourse pages and found that apparently there’s an option (one that I didn’t see in the documentation) you can add to the cluster.yml file
strict_host_checking:

I set this value to false and then I could submit jobs and also see jobs in the queue.

I also noticed that when running the rake test mentioned above it seems to be setting "UserKnownHostsFile=/dev/null" which I think is also a problem.
Is there a setting I can add in the configuration file to host file to something like /home/$USER/.ssh/known_hosts and then maybe I can turn strict checking back on? My concern is that by turning off strict checking that it presents some security risks

travert · February 8, 2023, 5:23pm

I don’t think I see anything about ensuring the cluster and its public key have been added to the web host’s known_hosts file in the updates above though.

Could you see if the cluster has an entry in the web host’s known_hosts file to start?

aweaver1fandm · February 9, 2023, 2:51pm

That did indeed work and I did not need strict_host_checking: false.

For anyone else with this problem, trying to submit jobs through Slurm to a cluster when OOD is hosted on a different server from the cluster here is what I believe to be a full fix.

Copy the cluster’s slurm.conf file to the server hosting OOD
Note: In our slurm.conf file we had one line that referenced <servername>.cluster
We modified the slurm.conf on the OOD server to remove the .cluster part
because it caused some errors as that domain name was local to our
cluster

In the cluster .yml file, the conf: value needs to refer to the location of slurm.conf
on the OOD server
In the cluster .yml file make sure under the job section you define submit_host:
According to one of the comments in this thread from OOD folks
if you don’t define that, then you need Slurm running on the OOD server
On the OOD server, check /etc/ssh to see if a known_hosts file exists.
a) If not create one
b) Add the cluster’s public key to the known_hosts file on the OOD server.
When I ran some of the rake tests I noticed some comments about
ECDSA keys so we manually copied/pasted our cluster’s public ECSDA
key into the OOD server known_hosts. The cluster’s key file was also in the
/etc/ssh directory.

The format we used in known_hosts was <serverIP>,<serverhostname><space><publickey>

system · August 8, 2023, 2:51pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failed to submit session with the following error: No such file or directory - /opt/slurm/bin/sbatch Get Help	5	378	March 16, 2024
Slurm cluster configuration test Get Help	10	975	January 1, 2024
SLURM initial config and testing Get Help question	11	964	August 14, 2022
SSH Slurm’s job submission and control clients Get Help question	3	88	December 2, 2024
Testing ondemand Get Help	11	982	May 26, 2022

Slurm job submission errors

Related topics