Launch app on specific host skipping the scheduler

mjbludwig · May 6, 2021, 4:21pm

Hi All,

I have a cluster with various fileservers that are sometimes used for compute instead of the general compute nodes in slurm that I would like to be able to launch jupyter notebooks on. I suppose I could put them in slurm but given the “linux_host” adapter, I was wondering if there was a way to do i without adding these fs’s to slurm?

jeff.ohrstrom · May 6, 2021, 6:23pm

Yea I think the linux host adapter may suite this use case, because it’s meant for login nodes or at least nodes that are not a part of the general compute infrastructure.

It does depend on a few things like tmux and singularity on the destination server, so just keep that in mind.

mjbludwig · May 10, 2021, 12:52pm

@jeff.ohrstrom The “cluster” setting in an app’s form.yml is immutable by the form page right? i.e. I cannot dynamically change its value with a form field/javascript?

jeff.ohrstrom · May 10, 2021, 4:33pm

No, you have a couple of options here in 1.8+.

Adding additional clusters in cluster attribute will give you a dropdown which you can interact with through javascript.

https://osc.github.io/ood-documentation/latest/app-development/interactive/form.html#configuring-which-cluster-to-submit-to

There’s a little complexity here though given how different adapters require fields. That is, if you were to say set a slurm cluster A and a linux host adapter B they’d want very different things in the native field of the submit.yml.erb. You can check this out as a reference on how to toggle these. At one point we had a Slurm cluster and a Torque cluster at the same time, so we accessed this information through OodAppkit.clusters[cluster].job_config[:adapter] and submitted different native args based on that.

github.com

OSC/osc-ood-config/blob/v76/ondemand.osc.edu/apps/bc_desktop/submit/vdi-submit.yml.erb

<%-
  quick_cluster_lookup = {
    "quick"         => "owens",
    "quick_pitzer"  => "pitzer",
  }

  torque_cluster = OodAppkit.clusters[cluster].job_config[:adapter] == 'torque'
  torque_args = "#{bc_num_slots}#{node_type}:#{quick_cluster_lookup[cluster]}"
  slurm_args = [ "--partition", "quick", "--nodes", "1", "--ntasks-per-node", "1" ]
-%>
---
script:
  native:
    <%- if torque_cluster -%>
    resources:
      nodes: "<%= torque_args %>"
    <%- else -%>
    <%- slurm_args.each do |arg| -%>
    - "<%= arg %>"
    <%- end -%>

This file has been truncated. show original

mjbludwig · May 11, 2021, 8:37pm

I think this is exactly what I needed. Key was the fact that simply adding multiple clusters to the clusters var in the form.yml was already built in!

Thanks for the submit.yml.erb snippet as well, that is how I will handle the native bits when submitted to slurm and nothing when sent to a linux_host adapter!

mjbludwig · May 19, 2021, 6:59pm

@jeff.ohrstrom I am getting pretty far with this but am now stuck with the apps launching correctly (I can see jupyter running as my user on the target host) and I can manually enter the url so the reverse proxy works but the apps in the portal go right to completed with no connect button.

I am guessing this is because the connector is failing to communicate with the process but I cannot seem to figure out what is blocking it? Any ideas?

jeff.ohrstrom · May 19, 2021, 8:09pm

Here’s a troubleshooting section. I’ve noticed a similar behaviour and have that section here, where ‘it just exists immediately’. There are steps to debug, but as an off the top guess, I’d scrutinize the submit_host and the ssh_hosts. ssh_hosts should be any host the submit_host can DNS resolve to.

https://osc.github.io/ood-documentation/latest/installation/resource-manager/linuxhost.html#troubleshooting

mjbludwig · June 16, 2021, 6:09pm

Hi @jeff.ohrstrom

Do I need to run an app intended for a linux host in a specific container like the wiki states by adding singularity_container: /usr/local/modules/netbeans/netbeans_2019.sif line to the native override in the submit.yml.erb in an app?

jeff.ohrstrom · June 16, 2021, 7:46pm

No, I think we run a base centos:7 image and just mount in everything we need. We really only use it for process management more than anything else.

So you could either use a basic image and mount in what you need (like we do for code-server) or you can have a specific image that holds what you need and have fewer mount ins. Totally up to you.

mjbludwig · June 17, 2021, 1:33pm

Ok cool.

Still trying to figure out why these jobs are going straight to completed even though the singularity container and its internal process are running fine. I can even manually enter the uri to redirect to the jupyterlab that is running on the node.

I can’t quite find where in the code how the state is being determined. I see ood_core/status.rb at master · OSC/ood_core · GitHub is handling state info for other parts to query but I don’t see the logic that does the actual test?

Do you happen to know off the top of your head what is being tested/queried on the node to determine state? I’m guessing its checking the PID of the singularity command?

Thanks!

mjbludwig · September 30, 2021, 7:44pm

@jeff.ohrstrom I just noticed that the tmp.XXXXX_tmux script is being generated with timeout 0s ..... which I am guessing is why the job is just completing instantly. Do you know where this is being set? I notice that in the documentation: Configure LinuxHost Adapter (beta) — Open OnDemand 1.8.12 documentation the example has a very large timeout set.

It looks like that timeout gets populated by the “site_timeout” setting in the cluster config yml, based off of ood_core/launcher.rb at fc8c05badb329817a04437f1736f09d1519a239d · OSC/ood_core · GitHub

Any idea what might be hard coding this to “0s”?

jeff.ohrstrom · September 30, 2021, 7:54pm

Looks like site_timeout is defaulting to 0 which is wrong. I’ve filed a bug for the same. You should set site_timeout to something (7200 in the example config is 2 hours).

mjbludwig · September 30, 2021, 8:04pm

yeah thats what I have it set for, just like the example. Wonder if something in m config is causing it to ignore that variable in the yml

jeff.ohrstrom · September 30, 2021, 8:16pm

OK - there could be a second bug here. Are you submitting the job with any walltime? You may need to do that.

mjbludwig · September 30, 2021, 8:22pm

I have not set anything for a wall time so no. Should I just set a “walltime: x” in the script → native section of the submit.yml?

jeff.ohrstrom · September 30, 2021, 8:44pm

You should be able to use bc_num_hours.

Or if you have walltime in the form, you can use it like this:

script:
  wall_time: <%= walltime %>

mjbludwig · September 30, 2021, 9:25pm

Can confirm this fixes the 0s timeout. Now onto debugging why tmux is crashing instantly with no output haha

mjbludwig · September 30, 2021, 9:41pm

Hmm, might have spoke too soon. I had an issue in my config that broke singularity from starting. The timeout is still fixed but still going right to complete but everything looks to start and run on the target host.

mjbludwig · September 30, 2021, 10:14pm

Ok well at this point, the app goes to completed immediately but if I manually change the url to the proxy and point on the target host, I can get there…

jeff.ohrstrom · October 1, 2021, 1:38pm

What version of ondemand are you on and whats the wrong and right urls?

Topic		Replies	Views
Multiple clusters from a single app Get Help	11	994	May 26, 2022
A few questions on the multi-cluster support and Linux host adapter in 1.8 Get Help	5	598	May 26, 2022
Multiple SLURM clusters and OnDemand Get Help question	5	1760	May 17, 2022
Strange behaviour when modifying second cluster file in cluster.d Get Help question	3	481	March 14, 2022
OOD portal with Slurm as a resource manager/two clusters Get Help question	5	1868	May 26, 2022

Launch app on specific host skipping the scheduler

Related topics