Problems starting VNC on OOD with LSF manager

Hello All,
First of all, the OOD platform is really impressive - nice work OSC!
I’m sure it’s something simple, but I get problems when starting a VNC session on LSF manager.
When starting it from the OOD interface, all the processes related to VNC get started.
However, in the frontend it keeps saying “Your session is currently starting… Please be patient as this process can take a few minutes.”, and the “Launch Desktop” button will never appear.
I tried it with SLURM and it worked, while on LSF there is this issue.
What am I missing?
Thanks!
Rotaugenlaubfrosch

Hi @rotaugenlaubfrosch, what does your $HOME/ondemand/data/sys/dashboard/batch_connect/sys/$VNC_APP_SUBDIR/output/$UUID/output.log say? Or the vnc.log?

Rotaugenlaubfrosch,

This sounds similar to an issue I had when setting up OOD with LSF, in my case it was a race condition, if LSF didn’t show a submitted job via ‘bjobs’ immediately after submission, the startup procedure would never continue.

My solution was to make a wrapper script for the bjobs command to have it sleep for a few seconds before running the actual ‘bjobs’ command, then pointed to it in cluster.yml:

job:
  bin_overrides:
    bjobs: "/path/to/bjobs/wrapper"

The delay was enough so that the launched job was found, and the startup worked afterwards:

Hi cjh,
Thank you very much for your answer - this looks promising.
Do you have an example of how the wrapper script looks like?
Rotaugenlaubfrosch

The wrapper I used is pretty straightforward:

#!/bin/bash
# Wrapper to sleep bjobs before running

SLEEP=5

OPERATION=/lsf/9.1/linux2.6-glibc2.3-x86_64/bin/bjobs

# Run
sleep $SLEEP
exec $OPERATION "$@"

Thanks cjh,
I created a wrapper script as you posted it but it doesn’t seem to help.
In the wrapper script, bjob gets executed correctly and the running job is visible.
However, OOD says that the job is starting, although it is already running on the node.


I noticed that there is an ajax request every ~10 seconds, probably to check the status of the submitted job. The wrapper script also gets executed every 10 seconds.
Do you know how OOD checks if the job turned from pending to running?
Thanks

Please look in your log directory for an indication, and post/share any relevant info from that log.

It would be somewhere like this.
$HOME/ondemand/data/sys/dashboard/batch_connect/sys/$VNC_APP_SUBDIR/output/$UUID/output.log

Where $VNC_APP_SUBDIR is maybe lsf_poc_desktop and $UUID is the ce25e64b...that you’ve just shared.

Also just to clarify the situation, when you start this job, you’re saying that it sits in the starting state forever? Or does it sit in that state then eventually delete itself?

Also during this state, what does LSF itself say about the job? I mean from LSF’s perspective, what state is the job in, like running or queued, etc?

@rotaugenlaubfrosch were you able to make any progress? Again, we’d want to see things from either output.log or vnc.log or both.

The job goes from starting to running state when a connection.yml is created in the job directory (the Session ID links to this directory in that screenshot you have). What happens is this: We boot everything up, then read the log file (vnc.log) for a particular line that indicates that everything went well. We then write this file out, at which point everything is good to go.

echo "Scanning VNC log file for user authentications..."
while read -r line; do
  if [[ ${line} =~ "Full-control authentication enabled for" ]]; then
    change_passwd
    create_yml
  fi
done < <(tail -f --pid=${SCRIPT_PID} "vnc.log") &
Scanning VNC log file for user authentications...
# this is the logline right here 
Generating connection YAML file...
Launching desktop 'xfce'...

Hope that’s helpful.

Hi All,
By default, LSF runs batch jobs using /bin/sh.
However, in order to interpret all commands correctly the /bin/bash shell must be used.
When adding the #!/bin/bash line to the wrapper script, an empty line gets inserted to the top of job_script_content.sh which results in LSF still using /bin/sh.

In a nutshell, the solution was to create a #!/bin/bash header as shown here:

---
batch_connect:
  template: vnc
  header: |
    #!/bin/bash
  script_wrapper: |
    export PATH="/system/apps/TurboVNC/bin:$PATH"
    export WEBSOCKIFY_CMD="/system/apps/websockify/run"
    %s
  ...

I couldn’t find anything regarding that in the OOD documentation - so adding this to the docs may be helpful for others too.
Thanks

1 Like