Job composer and desktop not working

Hi,

I have a close to working ondemand dev environment setup, but am having issues getting the GUI job submissions, and GUI desktop environment jobs to work. They both fail with something about failing to contact the slurm controller, or that the cluster wasn’t set.

I successfully tested sending jobs via the terminal with the test workflow here:

https://osc.github.io/ood-documentation/latest/installation/resource-manager/test.html

And confirmed that worked successfully by reading in the slurmctld logs on the controller node that the job worked.

Is there some extra step I am missing to get the job composer to work? Some other log I should be looking at?

Thanks,

Miles

v2:
  metadata:
    title: "plan9"
  login:
    host: "exadev2.ohsu.edu"
  job:
    adapter: "slurm"
    bin: "/usr/bin/"
    conf: "/etc/slurm/slurm.conf"

App 5736 output: [2023-09-12 17:12:07 -0400 ] DEBUG "[562dbe76-2c56-495c-b522-277c12f06bfb]   \e[1m\e[36mWorkflow Load (2.4ms)\e[0m  \e[1m\e[34mSELECT \"workflows\".* FROM \"workflows\" WHERE \"workflows\".\"id\" = ? LIMIT ?\e[0m  [[\"id\", 1], [\"LIMIT\", 1]]"
App 5736 output: [2023-09-12 17:12:07 -0400 ] DEBUG "[562dbe76-2c56-495c-b522-277c12f06bfb]   \e[1m\e[36mJob Load (1.8ms)\e[0m  \e[1m\e[34mSELECT \"jobs\".* FROM \"jobs\" WHERE \"jobs\".\"workflow_id\" = ?\e[0m  [[\"workflow_id\", 1]]"
App 5736 output: [2023-09-12 17:12:07 -0400 ]  INFO "[562dbe76-2c56-495c-b522-277c12f06bfb] execve = [{\"SLURM_CONF\"=>\"/etc/slurm/slurm.conf\"}, \"/usr/bin/sbatch\", \"-A\", \"acc\", \"--export\", \"NONE\", \"--parsable\"]"
App 5736 output: [2023-09-12 17:12:16 -0400 ] ERROR "[562dbe76-2c56-495c-b522-277c12f06bfb] An error occurred when submitting jobs for simulation 1: sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)"
App 5736 output: [2023-09-12 17:12:16 -0400 ]  INFO "[562dbe76-2c56-495c-b522-277c12f06bfb] method=PUT path=/pun/sys/myjobs/workflows/1/submit format=html controller=WorkflowsController action=submit status=302 duration=9014.72 view=0.00 db=4.21 location=https://openondemanddev.ohsu.edu/pun/sys/myjobs/workflows"
App 5736 output: [2023-09-12 17:12:16 -0400 ] DEBUG "[12680650-8366-4e51-b025-bc43c7a2bbc5]   \e[1m\e[36mWorkflow Load (2.0ms)\e[0m  \e[1m\e[34mSELECT \"workflows\".* FROM \"workflows\" INNER JOIN \"jobs\" ON \"jobs\".\"workflow_id\" = \"workflows\".\"id\" WHERE \"jobs\".\"status\" IN (?, ?, ?, ?)\e[0m  [[\"status\", \"H\"], [\"status\", \"Q\"], [\"status\", \"R\"], [\"status\", \"S\"]]"
App 5736 output: [2023-09-12 17:12:16 -0400 ] DEBUG "[12680650-8366-4e51-b025-bc43c7a2bbc5]   \e[1m\e[35mSQL (1.8ms)\e[0m  \e[1m\e[34mSELECT \"workflows\".\"id\" AS t0_r0, \"workflows\".\"created_at\" AS t0_r1, \"workflows\".\"updated_at\" AS t0_r2, \"workflows\".\"job_attrs\" AS t0_r3, \"workflows\".\"name\" AS t0_r4, \"workflows\".\"batch_host\" AS t0_r5, \"workflows\".\"staged_dir\" AS t0_r6, \"workflows\".\"script_name\" AS t0_r7, \"jobs\".\"id\" AS t1_r0, \"jobs\".\"workflow_id\" AS t1_r1, \"jobs\".\"status\" AS t1_r2, \"jobs\".\"job_cache\" AS t1_r3, \"jobs\".\"created_at\" AS t1_r4, \"jobs\".\"updated_at\" AS t1_r5 FROM \"workflows\" LEFT OUTER JOIN \"jobs\" ON \"jobs\".\"workflow_id\" = \"workflows\".\"id\"\e[0m"
App 5736 output: [2023-09-12 17:12:16 -0400 ]  INFO "[12680650-8366-4e51-b025-bc43c7a2bbc5] method=GET path=/pun/sys/myjobs/workflows format=html controller=WorkflowsController action=index status=200 duration=8.11 view=2.43 db=3.83"
App 5736 output: [2023-09-12 17:12:16 -0400 ] DEBUG "[42c31484-8679-4c27-92ad-d8cd2fbed03b]   \e[1m\e[36mWorkflow Load (2.0ms)\e[0m  \e[1m\e[34mSELECT \"workflows\".* FROM \"workflows\" INNER JOIN \"jobs\" ON \"jobs\".\"workflow_id\" = \"workflows\".\"id\" WHERE \"jobs\".\"status\" IN (?, ?, ?, ?)\e[0m  [[\"status\", \"H\"], [\"status\", \"Q\"], [\"status\", \"R\"], [\"status\", \"S\"]]"
App 5736 output: [2023-09-12 17:12:16 -0400 ] DEBUG "[42c31484-8679-4c27-92ad-d8cd2fbed03b]   \e[1m\e[36mWorkflow Load (1.7ms)\e[0m  \e[1m\e[34mSELECT \"workflows\".* FROM \"workflows\" WHERE \"workflows\".\"id\" = ? LIMIT ?\e[0m  [[\"id\", 1], [\"LIMIT\", 1]]"
App 5736 output: [2023-09-12 17:12:16 -0400 ] DEBUG "[42c31484-8679-4c27-92ad-d8cd2fbed03b]   \e[1m\e[36mJob Load (1.7ms)\e[0m  \e[1m\e[34mSELECT \"jobs\".* FROM \"jobs\" WHERE \"jobs\".\"workflow_id\" = ?\e[0m  [[\"workflow_id\", 1]]"
App 5736 output: [2023-09-12 17:12:16 -0400 ] DEBUG "[42c31484-8679-4c27-92ad-d8cd2fbed03b]   \e[1m\e[36mCACHE Workflow Load (0.0ms)\e[0m  \e[1m\e[34mSELECT \"workflows\".* FROM \"workflows\" WHERE \"workflows\".\"id\" = ? LIMIT ?\e[0m  [[\"id\", 1], [\"LIMIT\", 1]]"
App 5736 output: [2023-09-12 17:12:16 -0400 ] DEBUG "[42c31484-8679-4c27-92ad-d8cd2fbed03b]   \e[1m\e[36mJob Load (1.6ms)\e[0m  \e[1m\e[34mSELECT \"jobs\".* FROM \"jobs\" WHERE \"jobs\".\"workflow_id\" = ? ORDER BY \"jobs\".\"id\" DESC LIMIT ?\e[0m  [[\"workflow_id\", 1], [\"LIMIT\", 1]]"
App 5736 output: [2023-09-12 17:12:16 -0400 ] DEBUG "[42c31484-8679-4c27-92ad-d8cd2fbed03b]   \e[1m\e[36mCACHE Job Load (0.0ms)\e[0m  \e[1m\e[34mSELECT \"jobs\".* FROM \"jobs\" WHERE \"jobs\".\"workflow_id\" = ? ORDER BY \"jobs\".\"id\" DESC LIMIT ?\e[0m  [[\"workflow_id\", 1], [\"LIMIT\", 1]]"
App 5736 output: [2023-09-12 17:12:16 -0400 ] DEBUG "[42c31484-8679-4c27-92ad-d8cd2fbed03b]   \e[1m\e[36mCACHE Job Load (0.0ms)\e[0m  \e[1m\e[34mSELECT \"jobs\".* FROM \"jobs\" WHERE \"jobs\".\"workflow_id\" = ?\e[0m  [[\"workflow_id\", 1]]"
App 5736 output: [2023-09-12 17:12:16 -0400 ]  INFO "[42c31484-8679-4c27-92ad-d8cd2fbed03b] method=GET path=/pun/sys/myjobs/workflows/1.json format=json controller=WorkflowsController action=show status=200 duration=12.29 view=2.73 db=6.99"

Hi, thanks for the post!

The test is a bit of a beta feature still in the works, so I apologize for that. Looking under job I see there’s no host set, which you need for jobs to submit which might be the issue with the job composer submission.

The documentation here might explain this better, but basically that login you set is intended more for the Shell app not the actual compute cluster. The example here might make this clear if you look at the two hosts being set, one under login and the other for the job, with slightly different names to convey this:

...
login:
  host: "owens.osc.edu"
job:
  ...
  host: "owens-batch.ten.osc.edu"
...

https://osc.github.io/ood-documentation/latest/installation/cluster-config-schema.html#first-an-example

Hopefully this page helps clear things up some more but please feel free to ask if anything is unclear.

Hi Travis,

Thanks for the information.

v2:
  metadata:
    title: "plan9"
  job:
    adapter: "slurm"
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm.conf"
    cluster: "plan9"

I tried a few iterations of this and its still giving the same error.

Make sure to add the host key under the job block. It’s still missing that and so ood doesn’t know what cluster to submit the jobs still.

v2:
  metadata:
    title: "plan9"
  job:
    adapter: "slurm"
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm.conf"
    host: "exadev2.ohsu.edu"

Hi Travis,

I tried that again today, and its still not working. Same error as well. I came across an old issue describing a similar issue:

https://lists.openhpc.community/g/users/topic/slurm_compute_node_unable_to/73173441

But it appears the slurm configuration option was depreciated in favor of SlurmctldHost

Which we do have configured in our slurm.conf file.

We ran some trace tests and found that no tcp connections at all starting when that “submit” button is clicked.

Let’s just cover a few things before we start looking at bugs for slurm and ensure the configuration files are correct.

The newest file is looking better. Now, is the Slurm binary also already installed on the web-node with ood and on the compute cluster to run the slurm command on?

Hi Travis,

Good news, it looks like it was an selinux issue:

type=SYSCALL msg=audit(1694638117.865:790): arch=c000003e syscall=42 success=no exit=-13 a0=3 a1=20c90d0 a2=80 a3=7fffe6983820 items=0 ppid=39754 pid=39793 auid=4294967295 uid=5624 gid=3010 euid=5624 suid=5624 fsuid=5624 egid=3010 sgid=3010 fsgid=3010 tty=(none) ses=4294967295 comm="sbatch" exe="/usr/bin/sbatch" subj=system_u:system_r:ood_pun_t:s0 key=(null)ARCH=x86_64 SYSCALL=connect AUID="unset" UID="perrymil" GID="HPCUsers" EUID="perrymil" SUID="perrymil" FSUID="perrymil" EGID="HPCUsers" SGID="HPCUsers" FSGID="HPCUsers"
type=AVC msg=audit(1694638118.865:791): avc:  denied  { name_connect } for  pid=39793 comm="sbatch" dest=6817 scontext=system_u:system_r:ood_pun_t:s0 tcontext=system_u:object_r:unreserved_port_t:s0 tclass=tcp_socket permissive=0

We successfully submitted a job from the job composer after setting it to permissive. It looks like selinux wasn’t sure what to do about ood running sbatch? I installed the selinux ood package, and was under the impression it would handle things like this, but I guess it didn’t?

SELinux, the most secure way to break things and then lead you to make changes to possibly not be secure. My value of SELinux is not high for this reason, but I also understand policy might be enforcing this.

Glad it’s working now that you have it in permissive, but if it’s in permissive mode why use it? Or was this to just see if SELinux was the issue?

Did you look a the SELinux page and set the flags correctly for Slurm?
https://osc.github.io/ood-documentation/latest/installation/modify-system-security.html

Note by default ondemand_use_slurm is set to false which might be the first issue. Hopefully that page gets you back to a working state with SELinux though.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.