CAS integration: (was 401 Unauthorized) now is: Index of /

On the server where OOD is installed yes this exists:

ls -l /home/mysuser/test_jobs/
total 0
[myuser@openondemand ~]$ ls -ld /home/myuser/test_jobs/
drwxr-xr-x 2 rk3199 domain users 6 Oct  7 11:18 /home/myuser/test_jobs/

Now that I switched to the production server I had to create that directory.

So there must be some confusion between where OOD is installed and the actual cluster login/submit node.

I get this error:

Job has status of completed

Output file from job does not exist: 

/home/myuser/test_jobs/output_ourcluster_2024_10_07t15_06_54_04_00_log
Test for 'ourcluster' FAILED!
Finished testing cluster 'ourcluster'

However the log file on production exists:
/home/myuser/test_jobs/output_ourcluster_2024_10_07t15_06_54_04_00_log

And its contents:
TEST A B C

Is there a misconfiguration?

The web node needs the same $HOME mount point that the cluster has. OnDemand uses the files in your $HOME to prep the job (on the web node side) and to react from the job.

For example the job has to write what host it’s on when the job is running. It writes this to a file in your $HOME that OnDemand (on the web node) reads so it knows where to proxy request to.

Got it ok I mounted /home. What would cause this error?

The cluster config for ourcluster has a problem: (<unknown>): did not find expected key while parsing a block mapping at line 2 column 1

Edit: now it’s:
The cluster config for **ourcluster** has a problem: (<unknown>): did not find expected key while parsing a block mapping at line 8 column 6

 1 ---
 2 v2:
 3    metadata:
 4	title: "Ourcluster"
 5    login:
 6	host: "ourcluster.ouruni.edu"
 7    job:
 8	adapter: "slurm"
 9	submit_host: "ourcluster.ouruni.edu"
10	ssh_hosts:
11        - ourcluster.ouruni.edu
12	cluster: Ourcluster

Is there an indentation problem?

Yes, it should have this format. I pull this directly from the documentation.

---
v2:
   metadata:
     title: "My Cluster"
   login:
     host: "my_cluster.my_center.edu"
   job:
     adapter: "slurm"
     cluster: "Ourcluster"
     conf: "/path/to/slurm.conf"
     submit_hosts:
       - ourcluster.ouruni.edu

I used a JSON tester and it validated what could be causing the error?

The cluster config for **ourcluster** has a problem: (<unknown>): did not find expected key while parsing a block mapping at line 8 column 6

---
v2:
   metadata:
     title: "Axon"
   login:
     host: "ourcluster.ouruni.edu"
   job:
     cluster: "Ourcluster"
     adapter: "slurm"
     conf: "/etc/slurm/slurm.conf"
     submit_hosts:
       - ourcluster.ouruni.edu
     bin: "/sbin" 
     bin_overrides:
       sbatch: "/usr/bin/sbatch"
       squeue: "/usr/bin/squeue"
       scontrol: "/usr/bin/scontrol"
       scancel: "/usr/bin/scancel"
     strict_host_checking: false
     copy_environment: false

Line 8 is the cluster: "Ourcluster" line

Maybe you need to restart your webserver (in the help menu) to pick up the new/valid configs?

Indeed that got me past the error. I thought systemctl restart httpd would take care of that.

How can I troubleshoot why Jobs are not displaying, No data available in table?

sudo su myuser -c 'scl enable ondemand -- bin/rake test:jobs:ourcluster RAILS_ENV=production'
Testing cluster 'ourcluster'...
Submitting job...
rake aborted!
Errno::ENOENT: No such file or directory - /usr/bin/sbatch

It’s not seeing sbatch on the login/submit node?

You can see in the /var/log/ondemand-nginx/$USER/error.log exact commands we issue (grep for execve or squeue or similar).

You can issue these same commands to replicate. Also note that in activejobs you may have some filter turned on like only show my jobs or only show my jobs on cluster X in which case you don’t actually have any jobs running.

OK why is it not finding these commands on the actual submit node?

App 829031 output: [2024-10-08 10:05:58 -0400 ] WARN "Error opening MOTD at \nException: bad URI(is not URI?): nil"

App 829031 output: [2024-10-08 10:06:10 -0400 ] ERROR "Errno::ENOENT: No such file or directory - /usr/bin/squeue\n/usr/share/ruby/open3.rb:222:in spawn'\n/usr/share/ruby/open3.rb:222:in popen_run'\n/usr/share/ruby/open3.rb:103:in popen3'\n/usr/share/ruby/open3.rb:290:in

edit: now seeing this error

Testing cluster 'ourcluster'...
Submitting job...
rake aborted!
OodCore::JobAdapterError: hostname contains invalid characters

What characters are invalid?

According to the docs here, the option is submit_host, but your example has submit_hosts which is correct?

Sorry, docs are always right. I copied it and started to restructure it to look more like yours.

Progress!

Got job id '3764893'
Job has status of queued
Job has status of completed
Test for 'ourcluster' PASSED!
Finished testing cluster 'ourcluster'

For this to work however I had to create a SSH key, i.e.,g ssh-keygen then ssh-copy-id -i ~/.ssh...
Otherwise I get this error:

OodCore::JobAdapterError: Warning: Permanently added 'ourcluster.ouruni.edu' (ED25519) to the list of known hosts.
myuser@ourcluster.ouruni.edu: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

Is there a better way to handle so we don’t have to tell all users to do the same?

Yes, sshd supports HostBasedAuthentication in which the servers themselves have key pairs not any given user.

So something like this. I’d be curious to see how others have done this so I’ll search around. We do use sssd but I’m not sure there’s a way to use that for this?

Edit I see a thread about using munge is this still an option? Or this wrapper for version 1.5?

jeff.ohrstrom do you know if this wrapper will still work in OOD 3.1?

Yes that copy_environment and job_envorionment work. At OSC we have slurm binaries on the webnode itself. But we also use HostBasedAuthentication so folks can ssh here and there easily.