Issue Getting Jobs to Submit to the Slurm Cluster via OnDemand (RHEL 8.6, OnDemand 2.0.27)

Hey All,

So I am working on finalizing our OnDemand Instance for our Slurm-based HPC. Right now, every time I try to get a job or interactive app working, I receive the following error:

sbatch: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Permission denied
sbatch: error: Sending PersistInit msg: Permission denied
sbatch: error: Sending PersistInit msg: Permission denied
sbatch: error: DBD_GET_CLUSTERS failure: Permission denied
sbatch: error: Problem talking to database
sbatch: error: There is a problem talking to the database: Permission denied.  Only local cluster communication is available, remove --cluster from your command line or contact your admin to resolve the problem.

I have everything configured based on the documentation. For example, here is the /etc/ood/config/clusters.d/link.yml:

v2:
  metadata:
    title: "Link (Bowser v2.0)"
  login:
    host: "link.phys.wvu.edu"
    default: true
  job:
    adapter: "slurm"
    cluster: "Korok"
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm.conf"
  batch_connect:
    basic:
      script_wrapper: |
        module purge
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"
    vnc:
      script_wrapper: |
        module purge
        export PATH="/opt/TurboVNC/bin/:$PATH"
        export WEBSOCKIFY_CMD="/opt/websockify-0.10.0/run"
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"

I have confirmed that the reverse proxy is set up correctly and works fine (starting nc -l 5432 on node05.korok allows a connection when going to http://link.phys.wvu.edu/node/node05.korok/5432).

sbatch is working just fine on the system for users, so its not a slurm issue per se.

Any ideas on what might be going on here?

~ Joe G.

Further on this, despite jobs existing in the squeue for the Korok Partition, I cannot see them on:
https://link.phys.wvu.edu/pun/sys/dashboard/activejobs?jobcluster=all&jobfilter=all

That seems to indicate that there is indeed a communication issue between Slurm and OnDemand. Is there a specific place I should be entering Slurm accounting information so it can access the sacctmgr database?

~ Joe G.

Because you’re using this flag

cluster: "Korok"

We’re issuing the -M Korok flag to squeue and sbatch. This is what slurm seems to be complaining about wanting you to remove --cluster from your command line. Maybe try removing that?

At this point you probably want to remove OpenOnDemand from the equation and just troublehshoot from the command line on the same host. We’re issuing command just like you would. Indeed you can find the commands we issue in /var/log/ondemand-nginx/$USER/error.log.

Issue sbatch or squeue commands from the CLI both with and without -M Korok to see what works.

Hey Jeff,

So I noticed that Slurm sees Korok as korok, so I changed that casing (though I will note that using -M Korok or -M korok gives the same response, see below).

(base) [jpg00017@link dashboard]$ /usr/bin/squeue --all --states=all -M Korok
CLUSTER: korok
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                52     Korok     wrap  prb0007  R    1:37:42      1 node04
(base) [jpg00017@link dashboard]$ /usr/bin/squeue --all --states=all -M korok
CLUSTER: korok
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                52     Korok     wrap  prb0007  R    1:36:33      1 node04

Running it with the formatting included (which is what is reported in the logs):

(base) [jpg00017@link dashboard]$ /usr/bin/squeue --all --states=all --noconvert -o \\u001E%a\\u001F%N\\u001F%Y\\u001F%j\\u001F%u\\u001F%P\\u001F%M\\u001F%A\\u001F%t -M korok
CLUSTER: korok
\u001EACCOUNT\u001FNODELIST\u001FSCHEDNODES\u001FNAME\u001FUSER\u001FPARTITION\u001FTIME\u001FJOBID\u001FST
\u001E(null)\u001Fnode04\u001F(null)\u001Fwrap\u001Fprb0007\u001FKorok\u001F1:40:06\u001F52\u001FR

I also went through the testing suite for the cluster configuration which gave:

(base) [jpg00017@link dashboard]$ sudo su $USER -c 'scl enable ondemand -- bin/rake test:jobs:link RAILS_ENV=production'
[sudo] password for jpg00017: 
Rails Error: Unable to access log file. Please ensure that /var/www/ood/apps/sys/dashboard/log/production.log exists and is writable (ie, make it writable for user and group: chmod 0664 /var/www/ood/apps/sys/dashboard/log/production.log). The log level has been raised to WARN and the output directed to STDERR until the problem is fixed.
mkdir -p /minish/jpg00017/test_jobs
Testing cluster 'link'...
Submitting job...
[2022-07-08 10:14:55 -0400 ]  INFO "execve = [{\"SLURM_CONF\"=>\"/etc/slurm/slurm.conf\"}, \"/usr/bin/sbatch\", \"-D\", \"/minish/jpg00017/test_jobs\", \"-J\", \"test_jobs_link\", \"-o\", \"/minish/jpg00017/test_jobs/output_link_2022_07_08t10_14_55_04_00_log\", \"-t\", \"00:01:00\", \"--export\", \"NONE\", \"--parsable\", \"-M\", \"korok\"]"
Got job id '59'
[2022-07-08 10:14:55 -0400 ]  INFO "execve = [{\"SLURM_CONF\"=>\"/etc/slurm/slurm.conf\"}, \"/usr/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%A\\u001F%i\\u001F%t\", \"-j\", \"59\", \"-M\", \"korok\"]"
Job has status of queued
[2022-07-08 10:15:01 -0400 ]  INFO "execve = [{\"SLURM_CONF\"=>\"/etc/slurm/slurm.conf\"}, \"/usr/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%A\\u001F%i\\u001F%t\", \"-j\", \"59\", \"-M\", \"korok\"]"
Job has status of completed
Test for 'link' PASSED!
Finished testing cluster 'link'

So yeah, sbatch and squeue are having no issue at all here. There seems to be a communication error between OnDemand and Slurm (both are running on the same head node).

~ Joe G.

Are you issuing those test commands on this head node? If they work from the CLI on the same node, they should be working from within OOD… Maybe you need to restart your PUN up at the top right. Other than that… there’s something we’re missing.

Yep, all of the above commands are running on the head node which is where OnDemand and Slurm are situated.

Restarting the web-server does not seem to fix the issue.

Here is what the /etc/slurm/slurm.conf looks like (I have removed everything that is just set to the default settings with #):

#
# See the slurm.conf man page for more information.
#
ControlMachine=link
ControlAddr=10.1.0.253
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=pmix
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/d
SlurmUser=slurm
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/none
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/linear
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=Korok
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#
# COMPUTE NODES
NodeName=node[01-20] CPUs=24 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=64000
PartitionName=Korok Nodes=node[01-20] Default=YES MaxTime=INFINITE State=UP

~ Joe G.

Do you run SELinux? I think there are some policies you have to enable.

I do in fact run SELinux (though I must admit with the number of things I shake my fist at regarding it, I have been tempted to disable it give we are behind the university firewall).

I went through the documentation for setting up OnDemand with SELinux. Let me know if there are specific configurations I need to search for.

~ Joe G.

I think you need to set this boolean, ondemand_use_slurm.

sudo setsebool -P ondemand_use_slurm=on

That seems to have done the trick.

However upon resetting my PUN, the shell access, which worked fine before, now looks like:

Here are the current sebooleans:

(base) [jpg00017@link ~]$ sudo getsebool -a | grep *ondemand*
ondemand_manage_user_home_dir --> off
ondemand_manage_vmblock --> off
ondemand_use_kerberos --> off
ondemand_use_kubernetes --> off
ondemand_use_ldap --> off
ondemand_use_nfs --> on
ondemand_use_shell_app --> off
ondemand_use_slurm --> on
ondemand_use_ssh --> on
ondemand_use_sssd --> on
ondemand_use_torque --> off

Setting ondemand_use_shell_app --> on does not fix it.

~ Joe G.

That seems like it’d be the ticket, but even so what you’re seeing looks like client side errors. You say it worked before? That’s odd…

You may have to show errors from your browser like the console logs and/or network errors?

It’s working now. Apparently, the CAS cert expired but didn’t kick me fully off. Thanks for the help! :smiley:

~ Joe G.

1 Like