LSF multi-cluster environment deleting panel

rotaugenlaubfrosch · June 5, 2020, 4:14pm

Hi All,
Occasionally users experience that the starting session panel disappears after ~1 minute and the “connect to application” panel won’t show up.

Starting session panel:

After checking the logs, it looks like the job was submitted successfully and ran until the queue run time limit was reached.

output.log

[...]
Started at Thu Jun  4 12:41:43 2020
Terminated at Thu Jun  4 13:41:43 2020
Results reported at Thu Jun  4 13:41:43 2020
[...]
Script starting...
Waiting for RStudio Server to open port 35714...on host server
ESC[34mINFO:   ESC[0m Converting OCI blobs to SIF format
ESC[34mINFO:   ESC[0m Starting build...
Getting image source signatures
Copying blob sha256:d51af753c3d3a984351448ec0f85ddafc580680fd6dfce9f4b09fdb367ee1e3e
Copying blob sha256:fc878cd0a91c7bece56f668b2c79a19d94dd5471dae41fe5a7e14b4ae65251f6
Copying blob sha256:6154df8ff9882934dc5bf265b8b85a3aeadba06387447ffa440f7af7f32b0e1d
Copying blob sha256:fee5db0ff82f7aa5ace63497df4802bbadf8f2779ed3e1858605b791dc449425
Copying blob sha256:c86b5b5c0c52d2949c29006e38632744bb7cc3f88525959f70375692f5a90983
Copying blob sha256:2c0783ae876319800777d0ba051919b455bd5fafe8e3ba10bfcf3d8e8875bd9b
Copying blob sha256:0e5ecbaa74330c55752b884a5946537d2ae4a4da892ce817392948d9606c2daa
Copying blob sha256:e424bb47af136bcc59cde73b78f3afe16b75b9a33b951e09bcfc639f6c5020e9
Copying config sha256:70ee5049912bba1a24eeabc6a5fa59b76646a367eb3551e855587bbfa627ab89
Writing manifest to image destination
Storing signatures
2020/06/04 12:42:05  info unpack layer: sha256:d51af753c3d3a984351448ec0f85ddafc580680fd6dfce9f4b09fdb367ee1e3e
2020/06/04 12:42:06  info unpack layer: sha256:fc878cd0a91c7bece56f668b2c79a19d94dd5471dae41fe5a7e14b4ae65251f6
2020/06/04 12:42:06  info unpack layer: sha256:6154df8ff9882934dc5bf265b8b85a3aeadba06387447ffa440f7af7f32b0e1d
2020/06/04 12:42:06  info unpack layer: sha256:fee5db0ff82f7aa5ace63497df4802bbadf8f2779ed3e1858605b791dc449425
2020/06/04 12:42:06  info unpack layer: sha256:c86b5b5c0c52d2949c29006e38632744bb7cc3f88525959f70375692f5a90983
2020/06/04 12:42:06  info unpack layer: sha256:2c0783ae876319800777d0ba051919b455bd5fafe8e3ba10bfcf3d8e8875bd9b
2020/06/04 12:42:12  info unpack layer: sha256:0e5ecbaa74330c55752b884a5946537d2ae4a4da892ce817392948d9606c2daa
2020/06/04 12:42:18  info unpack layer: sha256:e424bb47af136bcc59cde73b78f3afe16b75b9a33b951e09bcfc639f6c5020e9
ESC[34mINFO:   ESC[0m Creating SIF file...
Starting up rserver...
+ SCRATCH_MOUNT=/path/user
+ export SINGULARITY_CACHEDIR=/path/user/.singularity
+ SINGULARITY_CACHEDIR=/path/user/.singularity
+ SINGULARITYENV_RSTUDIO_PASSWORD=password
+ singularity exec -c -H /home/user/home_rstudio:/home/user -B /tmp/tmp.brV9LOtAZA:/tmp -B /path/user -B /home/user/ood/data/sys/dashboard/batch_connect/sys/rstudio-srv/output/ce5b7031-3b8d-4f4d-9bfd-acd38226a724/Rprofile.site:/etc/R/Rprofile.site -B /home/user/ood/data/sys/dashboard/batch_connect/sys/rstudio-srv/output/ce5b7031-3b8d-4f4d-9bfd-acd38226a724/bin/auth:/bin/auth -B /home/user/ood/data/sys/dashboard/batch_connect/sys/rstudio-srv/output/ce5b7031-3b8d-4f4d-9bfd-acd38226a724/rsession.sh:/rsession.sh -B /home/user/ood/data/sys/dashboard/batch_connect/sys/rstudio-srv/output/ce5b7031-3b8d-4f4d-9bfd-acd38226a724/rsession.log:/rsession.log docker://rocker/rstudio:latest /usr/lib/rstudio-server/bin/rserver --www-port 35714 --auth-none 0 --auth-pam-helper-path /bin/auth --auth-encrypt-password 0 --rsession-path /rsession.sh
Discovered RStudio Server listening on port 35714!
Generating connection YAML file...
User defined signal 2

Since the job started successfully in the back end, may it be that there was an issue with communicating the state of the job to the front end properly?
Do you have an idea how we can pin down the cause of this issue?

mario · June 5, 2020, 4:39pm

Hey @rotaugenlaubfrosch!

The reason that the session may be stuck in a pending state could be that your login nodes aren’t able to see connection.yml generated on the compute nodes when starting a session.

Start with checking if your file system writes from your compute nodes are syncing quick enough to your login nodes.

Someone had a similar problem not too long ago:

rotaugenlaubfrosch · June 8, 2020, 11:59am

Hi @mario
Thanks for your reply! The job isn’t suck at pending state, but the panel completely disappears as if the job has ended. We have checked the shared FS, but it probably isn’t related to it since synchronization takes place fast.

mario · June 8, 2020, 2:06pm

@rotaugenlaubfrosch

This may be related to an open issue where jobs fail silently: https://github.com/OSC/ondemand/issues/232

I think Jupyter is failing to listen on a port. Could you share script.sh.erb? I will try and debug this locally

rotaugenlaubfrosch · June 10, 2020, 2:33pm

Hi @mario
Thanks for your help. Usually RStudio works fine - this problem just appears occasionally.
Next time it happens, we’ll investigate again and I will follow up with you.
Best regards

rotaugenlaubfrosch · June 24, 2020, 3:24pm

We were able to reproduce this issue.
We found out that the problem was caused by slow data sync between different filesystems.
Apart from that the following question arose: How does the front end check whether the application has started? Or in other words, how does the front end know when it should replace the “starting” session panel with the “running” or “connect to” panel?

mario · June 24, 2020, 4:46pm

@rotaugenlaubfrosch When on the My Interactive Sessions page, there is JavaScript in place to poll every every 5 seconds on a timer to check the status and re-render the session panel.

Here’s where that happens:

github.com

OSC/ondemand/blob/master/apps/dashboard/app/assets/javascripts/batch_connect/sessions.coffee#L9


# Place all the behaviors and hooks related to the matching controller here.
# All this logic will automatically be available in application.js.
# You can use CoffeeScript in this file: http://coffeescript.org/

#
# Timer object
#

class Timer
  constructor: (@callback, @delay) ->
    @remaining = @delay
    @active = true
    @resume()
  resume: () ->
    return unless @active
    @start = new Date()
    clearTimeout(@timerId)
    @timerId = setTimeout(@callback, @remaining)
  restart: () ->

github.com

OSC/ondemand/blob/40c6ab3e8715b3958eddc3d28703e3f3794bd0c1/apps/dashboard/app/helpers/batch_connect/sessions_helper.rb#L2-L34


def session_panel(session)
  content_tag(:div, id: session.id, class: "panel panel-#{status_context(session)} session-panel", data: { hash: session.to_hash }) do
    concat(
      content_tag(:div, class: "panel-heading") do
        content_tag(:h1, class: "panel-title") do
          concat link_to(content_tag(:strong, session.title), new_batch_connect_session_context_path(token: session.token))
          concat " (#{session.job_id})"
          concat(
            content_tag(:div, class: "pull-right") do
              num_nodes = session.info.allocated_nodes.size
              num_cores = session.info.procs.to_i

              # Generate nice status display
              status = []
              if session.starting? || session.running?
                status << content_tag(:span, pluralize(num_nodes, "node"), class: "badge") unless num_nodes.zero?
                status << content_tag(:span, pluralize(num_cores, "core"), class: "badge") unless num_cores.zero?
              end
              status << "#{status session}"
              status.join(" | ").html_safe

This file has been truncated. show original

fenz · June 26, 2020, 9:16am

Hi all, I’m working with @rotaugenlaubfrosch.
We are probably get closer to a solution (or at least to an understanding of the issue).
We have 2 OOD instances in different location and they share a path, let’s call it “global”.
The “ood” folder used for “data” folder is in global:
In the first instance, let’s call it ood1, we start an interactive app: APPNAME
/global/ood/data/sys/dashboard/batch_connect/sys/APPNAME/output/SessionID1
the apps starts without problems.
Then I connect to the second instance, ood2, in another location. I check “My interactive Sessions”, I see “You have no active sessions.”, I reload the page and the session in the first instance (ood1) is gone. Job will still run but I have no sessions anymore.
Where are the data for “Sessions” stored? How are the 2 instances using the same /global/ood/data/sys/dashboard/batch_connect/sys/APPNAME folder interfering?

fenz · July 3, 2020, 3:20pm

Is there anyone that can help answering my questions?
To recap:

Which are the files checked by the web app to understand if there’s any open session?
Is it possible to have 2 different OOD instances sharing the same path for interactive applications without having conflicts?

Thanks for you help

jeff.ohrstrom · July 6, 2020, 1:53pm

Hi sorry for the delay.

By default it will look in this directory ~/ondemand/data/sys/dashboard/batch_connect/db/ for sessions.

Yes this is possible, we do this at OSC for our test, dev and production instances. They all share our home NFS directories. (Note here, I specify home NFS directories. I just want to be clear that because we’re relying on file permissions, everything we do, every file we write is assumed be in either a home directory or similar. Your use of global may imply it’s a shared location where anyone can write to (like /tmp) and I want to be clear that ever user needs their own unique directory to write to).

However, all of the instances have the same cluster configurations. That is, if I start a job on instance1 to cluster1, then login to instance2, cluster1 needs to exist on that instance with the same configuration.

My guess is, in your case, instance 2 saw the job created by instance 1, queried the cluster (as defined by it’s cluster configs and the .cluster_id attribute of the file), could not find it and so assumed it was ‘complete’.

fenz · July 6, 2020, 6:38pm

Thanks for your answer. It makes sense now. The “cluster_id” is the same but the different clusters (managed with LSF Multi-cluster) are not sharing the job queue and when you run a “bjobs” from a cluster you just see the jobs running on that cluster. So, as you said, the instance2 checks the jobs (I guess based on the job_id specified in the “db” folder) and not seeing it, it assumes it was completed. In principle you can pass the option “-m cluster_name” to specify the cluster to query but I’m not sure this is used here. So a couple of questions:

Would it be possible to use the “cluster_id” value with the “-m” option in LSF (bjobs -m cluster_id)?
I assume cluster_id relates to the cluster where the ood instance that start the job request is, not the cluster that actually execute the job, right? Would it be possible to customise the info stored as “cluster_id” (again to be used with the “-m” option)?

jeff.ohrstrom · July 6, 2020, 8:06pm

cluster_id is OOD parlance and comes from the name of the cluster.yml file. It should be the same as the cluster, but doesn’t necessarily have to be.

You can specify cluster attribute in the configuration which will trigger the -m flag being populated, but apparently that’s not documented, so that’s a miss on our side.

Here’s an LSF configuration example that illustrates both of these.

# cluster_id will be "owens_cluster" not "owens"!  So all references to this
# cluster in any form.yml will have to be "owens_cluster".
#
# /etc/ood/config/clusters.d/owens_cluster.yml
---
v2:
  metadata:
    title: "Owens"
  login:
    host: "owens.osc.edu"
  job:
    adapter: "lsf"
    bindir: "/path/to/lsf/bin"
    libdir: "/path/to/lsf/lib"
    envdir: "/path/to/lsf/conf"
    serverdir: "/path/to/lsf/etc"
    # here is the missing cluster attribute
    cluster: "owens"

fenz · July 13, 2020, 5:13pm

Sorry for late reply.
I’m not sure I got it completely.
In our case we have 3 sites but we can make the example with just 2:

Cluster1 with OOD instance on a login node of that cluster1
Cluster2 with a second instance of OOD on a login node on the cluster2.

Now from Cluster2, for checking jobs in Cluster1 you need to run bjobs -m Cluster1.
That’s why, if the cluster name is part of the interactive app info (like in cluster_id) then we can have the session checking the correct cluster.
From the configuration you shared this seem to be local to the OOD instance so, for what I understand, cluster2 will run “bjobs -m Cluster2” and cluster1 will be running “bjobs -m Cluster1” (because they will look at the “local” name). Is my understanding correct?
The way it may work for us is:

Submit the interactive app job specifying the cluster (bsub -cluster Cluster1)
The information get stored in the “cluster_id” (cluster_id=Cluster1)
Session can check the job using “bjobs -m cluster_id” (so this should work either from any OOD instance).

Do you think that’s something feasible?

jeff.ohrstrom · July 13, 2020, 6:35pm

You’re understanding is close. In short: the cluster.d filenames should match the v2.job.cluster attribute. And you should probably have 3 of them, one for each cluster.

Here’s a very short example for cluster_1 (that’s missing all the other stuff above)

# /etc/ood/config/clusters.d/cluster_1.yml
v2:
  job:
    # cluster attribute here is the same as the filename
    cluster: "cluster_1"

It’s very important here that the cluster attribute here is the same as the filename. In your 3 cases, here’s where they come from.

bsub -m cluster_1 comes from the v2.job.cluster attribute. Any bjobs or bsub command will use the -m option with this string if it’s there (it wasn’t before, that’s why things are broken for you).
cluster_id is the an OOD reference, and it comes from the filename. It’s how OOD keeps track of clusters.
Combination of the two. OOD will read cluster_id from the file as cluster_1 and will configure itself from /etc/ood/config/clusters.d/cluster_1.yml. This is the relationship between the cluster_id and the filename. In this yml file it’ll read v2.job.cluster as cluster_1 so it will run a bsub command with -m cluster_1 because that cluster configuration is available.

fenz · July 14, 2020, 2:50pm

A bit better but maybe there’s still a gap.
LSF can submit job using multicluster with the option “-clusters”:
cluster_11$ bsub -clusters cluster_2
Now the issue is about the JOBID since the JOBID that you get with bjobs command on one cluster is different from the one in the other cluster (in case of bjobs the option for the multicluster is “-m”).
I can show you an output in our configuration (rid and rka are 2 different clusters):

RID cluster:

[maffiaa@ridnc001is03 ~]$ bsub -n 1 -clusters rka sleep 60
Job <267698> is submitted to default queue <long>.
[maffiaa@ridnc001is03 ~]$ bjobs 
USER       QUEUE      JOBID      JOBI APPLICATION   SERVICE_CLASS SLOTS STAT  IDLE_FACTOR RUN_TIME        RU_UTIME     RU_STIME     PEND_TIME    FIRST_HOST  NEXEC_HOST MEM        GPU_NUM    GPU_MODE 
maffiaa    long       267698     0          -             -       1     RUN   0.00        00:00:00        00:00:00     00:00:00     00:00:00     rkanc007is0 1          0 Mbytes       -          -

RKA cluster:

[maffiaa@rkanc001is01 ~]$ bjobs 
USER       QUEUE      JOBID      JOBI APPLICATION   SERVICE_CLASS SLOTS STAT  IDLE_FACTOR RUN_TIME        RU_UTIME     RU_STIME     PEND_TIME    FIRST_HOST  NEXEC_HOST MEM        GPU_NUM    GPU_MODE 
maffiaa    long       52943      0          -             -       1     RUN   0.00        00:00:00        00:00:00     00:00:00     00:00:00     rkanc007is0 1          2 Mbytes       -          -    
[maffiaa@rkanc001is01 ~]$ bjobs -m rid
USER       QUEUE      JOBID      JOBI APPLICATION   SERVICE_CLASS SLOTS STAT  IDLE_FACTOR RUN_TIME        RU_UTIME     RU_STIME     PEND_TIME    FIRST_HOST  NEXEC_HOST MEM        GPU_NUM    GPU_MODE 
maffiaa    long       267698     0          -             -       1     RUN        -      00:00:11        00:00:00     00:00:00     00:00:00     rkanc007is0 1              -          -          -

So:

To use the multicluster capability we need to be able to specify the “-clusters” option for the bsub command (and I guess that’s something that we can do in the submit for each app) instead of using the value from the yaml file (/etc/ood/config/clusters.d/cluster_1.yml in your example)
The name we used with the “-clusters” option should be stored in the app (like in the cluster_id field) so that OOD can check if the session is alive using bjobs -m “cluster_id” and get the same JOBID (otherwise I guess with different JOBIDs it it going to consider the job as completed).

It seems to me that OOD uses the info from “v2.job.cluster” (local to the cluster where the OOD instance run) with the option “-m” in bsub and bjobs commands (bsub accepts the “-m” to specify an host on a cluster but not the entire cluster where to submit or forward the job, so maybe this need to be “-clusters”). In this way we can’t control where to send the jobs.
By the way, the bigger issue we have is the one mentioned at the beginning of the thread: if we use a “shared” home then, not seeing job running from another cluster or seeing a different job id will cause the session to be cleaned. So even if we don’t want to forward jobs (because latency may be bad anyway), opening 2 different instances of OOD to work on different cluster will be impossible in case of shared home.

jeff.ohrstrom · July 14, 2020, 4:16pm

OK looking at our code, -m is used in querying bjobs and not during the submit, but you can add the -cluster argument in the submit.yml.erb as you’ve indicated. I believe this is the only solution for controlling where jobs are sent.

Can you deploy 1 OOD instance with 3 cluster configurations instead of 3 instances with 1 cluster config? It seems you’re deploying OOD on login nodes, maybe it’s more appropriate to give OOD it’s own VM as that’s generally the deployment strategy. This OOD VM should then be able to communicate with all clusters. This is the route I would suggest.

I believe you will only have collisions if you have colliding configs (i.e., /etc/ood/config/clusters.d/rka.yml on more than 1 instance. This will create jobs with cluster_id = rka on every instance it’s on, and so you have collisions). So the configuration file names and their contents need to be unique across deployments if you are deploying multiple separate instances. I believe if deployment on rid finds a cluster_id = rka it should simply ignore it becuase there is no cluster for rka on that node (there is no clusters.d/rka.yml on the rid host). This should be the behavior. Can you confirm this is the case, that you’re getting collisions because all instance configurations share the same filenames?

You can change the location of these files by using a different OOD_DATAROOT as described in this topic. Though I would leave this as a last possible resort. The uniqueness of each deployment is going to become untenable over time and likely confusing to your users, so I would suggest exploring 1 deployment with 3 unique configurations especially as 1.8 (the next release) rolls out because in that version apps can support multiple clusters.

fenz · July 14, 2020, 8:14pm

We have an OOD instance (actually 2 with a load balancer) for each of our site and we have 1 single url that will redirect to the closer instance (since we have a site in US and another in Europe the latency can be a factor). That’s why we have them on the login nodes of the specific sites.
The name of the cluster is generic and it is not the same as LSF clusters name, so I guess when OOD run a bjobs to check the running jobs it just get the “local” ones (so as you said the cluster_id is probably simply ignored).
Configuring 3 clusters in each OOD instance is not a big deal but I’m not sure how this will help. How will OOD decide which cluster was used? I guess there’s a “cluster” field in the “form.yml” of an app, is this used for defining which of the cluster configuration to use? Anyway the value used with the “-clusters” option will not help at all in terms of info to OOD, right?

jeff.ohrstrom · July 14, 2020, 9:22pm

That makes sense that they’re geolocated. You don’t need all three configurations, that only kind of makes sense if you have one instance. They do however just need to be unique in their filename. If you’re using a generic cluster name for all 3 clusters, then that’s the source of your issues.

Yes. cluster: "rka" will use the rka.yml configuration file. cluster in a form references the file in clusters.d and if the job is succesfully submitted it is the value of cluster_id. All of these are the same, and the filename is the source of it all.

It will if it’s the same as the v2.clusters.job. We have to get all of these 3 items the same.

Here’s how you’d configure your RKA cluster. It needs to have a unique filename (so it doesn’t collide with other cluster_ids) and the corresponding v2.job.cluster.

This config will ensure that OOD will be query bjobs with -m rka and you’ll create jobs with cluster_id: "rka"

# /etc/ood/config/clusters.d/rka.yml
v2:
  job:
    cluster: "rka"

Then when you add this to your submit.yml OOD will use the -clusters flag to submit the job to the correct cluster.

# An RKA app's submit.yml
script:
  native:
    - "-clusters"
    - "rka"

You need all of these for this to work. Unique filenames for each cluster, v2.job.cluster attribute populated for that cluster and -clusters script.native parameters configured in any apps submit.yml. And all 3 of these things need to be the same string.

fenz · July 15, 2020, 9:12am

I feel we are getting closer but I still see 2 issues here:

If bjobs -m uses the name from the configuration and not from “cluster_id” of the app this means that it will work when I’m on “rka” but when I connect to “rid” and check the session, bjobs -m will use “rid” and I will get a different jobID and the session will be killed (at least from web interface since the job will be still running). I guess it would make sense to use “cluster_id” with option “-m” since cluster_id should tell you from which cluster you run the job and thus the cluster to query.
Using the “cluster” from form.yml means that I need different apps for each site and this is something I would like to avoid.

Point 2 is something I guess we will have to deal with but most important is point 1 since if I got it right I guess we will always have the session disappearing from the web interface.

By the way, to clarify our configuration, we have a generic name for the cluster since this allows us to have all the same configuration and, at the end, we consider it just 1 cluster with 3 sites. This is why for us it was making sense to have just “generic.yml” and being able to decide where to run the job just using the “bsub -clusters” and “bjobs -m”. For me different configuration makes sense if you have really different clusters with different scheduler or paths (because the 3 yml files for us will be exactly the same except for the “cluster” name). Maybe a “multi-cluster” configuration can be one single yml with just a “list” of cluster names like:

# /etc/ood/config/clusters.d/generic.yml
v2:
  job:
    cluster:
      - rka
      - rid

But, clearly, I’m not sure how this affects or can be managed in OOD.
Probably this last part should have been discussed in a different topic but I wanted to mention it here to have the full picture.

Thanks a lot for your support

alanc · July 15, 2020, 1:49pm

Regarding point 2, that will be fixed in release 1.8 coming in the next few weeks. Apps will be able to submit to different clusters based upon a user choosing the cluster in the form. Note, you can grab the ‘latest’ repo to try to test this out now if you want.

Topic		Replies	Views
Unable to get interactive desktops running Get Help	45	1626	April 6, 2024
BUG - Updated OnDemand - Interactive Apps "Spill" over to other interactive apps on submission Get Help question	25	2125	May 26, 2022
Interactive desktop with OOD not running on cluster Get Help	12	929	September 2, 2023
Multiple clusters from a single app Get Help	11	1066	May 26, 2022
POC Interactive Desktop configuration Get Help ondemand2 , question	17	613	October 22, 2022

LSF multi-cluster environment deleting panel

Related topics