Are there plans to add the capability for users to run ‘desktop+apps’ directly on the ondemand host in addition to compute hosts? We plan to purchase a new physical machine to run ondemand and this feature would be very useful for simple pre/post job analysis rather than dedicating valuable compute resources.
See OnDemand 2.x Roadmap: Scalability Goals. The first goal under “Extendability” is “Interactive work without a batch scheduler”. One of the adapters is “SSH/Fork” which is actually two different adapters: 1. Fork and 2. SSH+Fork. The first of course is running interactive apps on the OnDemand host.
Your login node could be part of the scheduler, and you could user oversubscription if you’d prefer to have this always available (within reason).
In OnDemand 1.7 we will have a beta “Linux Host Adapter”. Documentation is in progress https://osc.github.io/ood-documentation/release-1.7/installation/resource-manager/linuxhost.html. It currently uses ssh+tmux+singularity to connect to any host (as configured in a cluster config file) and start processes. The recipe for process management will be improved (or other options added) in the future. With the right setup, this could be used to start interactive apps on the OnDemand host or on a login node.
At OSC we do something similar with a “quick batch” cluster where we have a queue of nodes from all 3 of our clusters and these are oversubscribed, jobs typically request 1 node and 1 proc per node, and since they are shared it grants users almost immediate access.
Our initial goal is to use the login nodes for interactive desktop jobs without the need to submit a job and wait. For pre and post processing, our users do this already on our login nodes.
We just want to make it also available with ood.
We are considering the login node cluster solution mentioned above.
Do you have an example of the oversubscribed login ‘cluster’ ?
Or do i need to change the desktop app? I am willing to write a patch.
@baverhey Hi and welcome! Are you looking for an example application? Or how the system login system is configured (in terms of cgroups and so on)? or how ondemand is configured to use the login node?
How to configure OOD to use the adapter (my 3rd question) is given in the documentation link above given by Eric. My first question - example application, one is tensorboard below. Another I just tried to prep for this reply was a full on desktop. I took one of our desktop configurations and literally just changed the cluster attribute from
linuxhost and it worked directly out of the box.
Here’s tensorboard but you’ll notice that the only real difference is the ‘cluster’ name in the
form.yml. It’s not significantly different than any other apps. (all the strange work put into that app around
unshare and so on is because tensorboard doesn’t have authentication not due to this adapter).
To be sure, it likely worked easily for me because of these configs which are about ensuring the correct libraries are available. And for the fact that I had put work (headache/heartache) into ensuring the adapter worked for another app.
# linuxhost.yml. Many items removed for brevity.
# we're mounting a lot of things. This means X11 libraries, and all those
# .so files we need.
# I was able to debug my jobs and resubmit them if I needed
# we have a module for VNC, so that's super convenient for VNC
# applications and PATH management.
module load ondemand-vnc
I can get it to work on all our normal clusters without the singularity.
module load TurboVNC
But not using the linux_host adapter. we are on the latest version of 1.6.20 (upgrading to ood 1.7 breaks ood)
#<LoadError: Could not load ‘linux_host’. Make sure that the job adapter in the configuration file is valid.>
I looked in the repo and found the PR in ood-core
Are there know issues when upgrading from 1.6 to 1.7 or an upgrade path? I’m willing to give 1.7 an other go.
Not that I’m aware of. Though when you upgrade you’ll have to be sure to bounce your own web server (per user nginx, PUN) at the top right under help.
To be clear, you can only have valid files in
/etc/ood/config/clusters.d with the version you have loaded. You’ll get that error when you have a
linux_host configuration and you’re actively running
linux_host configurations are only valid in
1.7. Note the word actively there, you could install and potentially still be running the previous version because it was still in memory.
As an aside, I think you’re running into something that we may not advertise well which is the need to roll through everyone’s PUNs and bounce them when you would need to (like in this case).