Launch app on specific host skipping the scheduler

1.8.18.

When entered manually

ood-server.edu/node/linuxhost-internal-dns-name/[port number]
works just like:
ood-server.edu/node/compute-node-internal-dns/[port number]

It looks like my proxy regex in ood-portal.yml is fine as manually entering the url works.

The only thing that looks off in the generated files is that connection.yml has duplicate entries, i.e. host \n port \n password \n host \n port \password which is confusing as the script that creates that file does not have an appending redirect…

Let me confirm the behaviour you’re seeing:

  • you start the job
  • it immediately gets marked as completed
  • though it actually isn’t. You can manually input the host/port and connect to it.
  • and for whatever reason there are multiple host/port configs in the connect.yml

Is this right?

Actually, false alarm on the repeated configs in connection.yml, the actual file only has 1 copy as it should. Looks like the file browsers view rendering is duplicating for some reason.

Otherwise, yes that is exactly my experienced behavior.

Hey sorry to forget about this for a bit.

Were you sure to scrutinize the submit_host and the ssh_hosts . ssh_hosts should be any host the submit_host can DNS resolve to. It could be that you’re scheduling on a node that we then are unable to query.

ssh_hosts are all the hosts we’ll query, so it needs to be a complete list of what submit_host can DNS resolve to.

https://osc.github.io/ood-documentation/latest/installation/resource-manager/linuxhost.html#troubleshooting

Hi Jeff,

Yeah, I have played around a bunch with those bits.

Just for reference, I have one yaml in clusters.d for my normal slurm queue then another one for this linux host specific target, that is kosher right?

In my linux host target cluster.d yaml I currently have submit_host set for the target fileserver and then that same file server listed in ssh_hosts. I have no round robin or anything so it should only need the target host listed in the ssh_hosts section right?

Here is an obfuscated portion of it:

v2:
  metadata:
    title: "Linux Host"
    hidden: true
  login:
    host: "target-fileserver.internaldns"
  job:
    adapter: "linux_host"
    submit_host: "target-fileserver.internaldns"
    ssh_hosts:
      - target-fileserver.internaldns
...

OK you’re on 1.8. Do you happen to use tmux2 or have that version of tmux in your PATH? I’m wondering if we’re using tmux 1 to start the job and tmux 2 to then query for it.

Oh that is a very good idea! I only have tmux 2.7 in the path (on both the ood host and the target).

That could be it. We fixed some bugs in 2.0, so it has better support for tmux 2.7 & 1.8 (I want to say 8 is the current 1. version).

In fact, look in your /var/log/ondemand-nginx/$USER/error.log and you may be seeing failures around being able to query for those jobs.

Not seeing any errors, just the usual sesson.js checks every 10s.

Are the bugs with OOD 1.8 interacting with tmux or in using different tmux versions between the target and the host?

There are bugs in 1.8 with using tmux 2.7 anywhere (it doesn’t like the field separator we were using).

Looks like it was patched in ood_core 0.16.

Ah ha, fantastic digging! Thanks Jeff.

For the time being, I am stuck on the newest 1.8 which does not have a new enough ood_core version. Can I just make the edit myself?

Yup, I just got it working with the edit! Thanks Jeff!

I would suggest updating the docs to mention that <2.0 OOD, linux_host adapter will not work with RHEL 8