LSF multi-cluster configuration

Hi all,

we moved to OOD 1.8 and are testing LSF multi-cluster.
Based on a previous discussion: LSF multi-cluster environment deleting panel - #26 by fenz
We set different cluster configurations and wanted to pass the “-clusters” option when submitting the job to start the app but this seems not working. I’ll start first writing what I think is the issue.
This code is used to get the “cluster” option to pass to LSF: ood_core/batch.rb at master · OSC/ood_core · GitHub
and here it where the “value” is used: ood_core/batch.rb at master · OSC/ood_core · GitHub
It seems “cluster_name” get passed to “-m” option for any LSF command.
The problem is that LSF is not “exactly” consistent in how using the “same” option in different commands.

In “bsub” command “-m” option is used as:

bsub -m "host_name@cluster_name[ | +[pref_level]] | host_group@cluster_name[ | +[pref_level | compute_unit@cluster_name[ | +[pref_level]] ..."
BSUB -m option: IBM ref

So you can have “@cluster_name” but you specify an “host” in a cluster and not an entire cluster.
The correct option for bsub to specify a cluster would be:

bsub -clusters "all [~cluster_name] ... | cluster_name[+[pref_level]] ... [others[+[pref_level]]]"
BSUB -clusters option: IBM ref

That’s what happens is bsub command.

Instead in “bjobs” command the -m “option” would be the right one for the cluster:
bjobs -m host_name ... | -m host_group ... | -m cluster_name ...
BJOBS -m option: IBM ref
since it takes either the host_name or host_group or cluster_name.

And now our problem.
If we don’t specify the “-clusters” option in our “submit.yml” we get an error like “bad host specification” (since it tries something like “bsub -m cluster_name”) and if we do specify the “clusters” options we get an error like “can’t use -m option with -clusters” (since it tries something like “bsub -m cluster_name -clusters cluster_name”).

For the “multi-cluster” specifically I guess it would be better to use “-clusters cluster_name” in case of “bsub” command and “-m cluster_name” in case of “bjobs” command. But I’m not sure if this will break anything else.
Any thoughts?

Yes currently for the LSF adapter, if cluster: is specified in the config, the adapter will add -m cluster_name as an argument to all the commands (bsub, bjobs, bstop, bresume, bkill). This was built against LSF 8 and 9 so I’m not sure if something changed or if we just had use cases to support didn’t expose this issue.

So are you saying you basically want to change the way bsub is invoked to use different arguments (instead of -m cluster_name, you want -clusters cluster_name? Or do you want -m cluster_name for bjobs and -m @cluster_name for bsub?

Thanks for the reply.
I can’t tell how it was working in previous versions of LSF. In version 10 it seems we have inconsistency in LSF behaviour.
I guess using only cluster information will be the right approach:
bsub -clusters cluster_name
and
bjobs -m cluster_name
I’m not sure specifying the host will be useful, at least not by default and not if we want to keep the current intended behaviour.
Do you have any chance to “reproduce” our issue? Even with only one LSF cluster, specifying “-m” option should generate a “bad host specification” error.

# rid is the local cluster
# sc1 is the remote cluster

$ bsub -m rid wait 30
Bad host specification: local cluster name cannot be specified. Job not submitted.

$ bsub -m sc1 wait 30
Cannot specify the remote cluster name with the -m option in Multi-Cluster or Single-Cluster mode. Job not submitted

$ bsub -clusters rid wait 30
Job <790087> is submitted to default queue <long>.

$ bsub -clusters sc1 wait 30
Job <790094> is submitted to default queue <long>.

$ bjobs -m rid
USER       QUEUE      JOBID      JOBI APPLICATION   SERVICE_CLASS SLOTS STAT  IDLE_FACTOR RUN_TIME        RU_UTIME     RU_STIME     PEND_TIME    FIRST_HOST  NEXEC_HOST MEM        GPU_NUM    GPU_MODE 
maffiaa    long       790087     0          -             -         -   PEND       -      00:00:00             -            -       00:00:00          -          -          -          -          -    

$ bjobs -m sc1 
USER       QUEUE      JOBID      JOBI APPLICATION   SERVICE_CLASS SLOTS STAT  IDLE_FACTOR RUN_TIME        RU_UTIME     RU_STIME     PEND_TIME    FIRST_HOST  NEXEC_HOST MEM        GPU_NUM    GPU_MODE 
maffiaa    long       532056     0          -             -       1     RUN   0.00        00:00:00        00:00:00     00:00:00     00:00:00     sc1nc008is0 1          0 Mbytes       -          -    

This is the behaviour in our cluster. Note that when submit a job, JOBID is local to the cluster from where you submit while the bjobs commands returns the JOBID of the “remote” cluster. But the most important part is that “bsub -m” seems not to work.
Let me know if you want me to test anything else to help with this.

Hey - sorry to not follow up on this for so long. I’ve opened this bug on our side. That said - we don’t have a lot of LSF support beyond our initial efforts. The folks who developed it are gone and had access to other centers. So, pull requests are welcome! I don’t have our IBM contact off hand, but you should engage them too if you pay them money.

Just to followup on what Jeff said, our contact on the IBM LSF development team is Joanna Wong, yjw@us.ibm.com I’d recommend touching base with her as well. The last conversation with had with IBM about OOD was last July I believe.