Interactive Desktop with Sun Grid Engine

Hello,
Running 2.0 of Open OnDemand and trying to enable interactive Desktop. We are running HPC cluster with Sun Grid Engine. However, the documentation doesn’t provide instructions on how to set up the submission config file for SGE. Wondering if anyone has been succesful in implementing this with SGE?

Thanks…

Hello and thanks for the question!

We do provide support for grid engine which you can find here:
https://osc.github.io/ood-documentation/latest/installation/resource-manager/sge.html

If you have any more questions after you look through the docs though please let us know.

Travis,
Thanks for the link. I do have the basics of OnDemand working with our cluster. Looking to enable the interactive desktop described here:

https://osc.github.io/ood-documentation/latest/enable-desktops/add-cluster.html

At the bottom of the page, it states that if try to launch it will fail miserably as first will need to set up submission parameters. That is where the documentation doesn’t provide an example for SGE.

https://osc.github.io/ood-documentation/latest/enable-desktops/custom-job-submission.html

So looking if someone else might have worked this out?

like

[root@XXX bc_desktop]# cat my_cluster.yml 
---
title: "My Cluster Desktop"
cluster: "sge"
attributes:
  desktop: "xfce"
  bc_queue: "name_of_queue_withGPUs"
  bc_num_slots: 1
  bc_account: ""
  bc_email_on_started: 0

[root@XXX bc_desktop]#

and I must admit I commented-out lines

args.concat ['-N', script.job_name] unless script.job_name.nil?

and

args.concat ['-P', script.accounting_id] unless script.accounting_id.nil?

in /opt/ood/ondemand/root/usr/share/gems/2.7/ondemand/2.0.20/gems/ood_core-0.18.1/lib/ood_core/job/adapters/sge/helper.rb because I’m moving to slurm anyway and I needed to have proof-of-concept fast.

Jose,
Thanks. What do you have for your conf file in the /etc/ood/config/apps/bc_desktop/submit/my_submit.yml.erb?

That is the example I’m looking for.

hi, I don’t have this file :confused:.

The only additional customisation I’ve done is in /etc/ood/config/clusters.d/sge.yml:

...
job:
                adapter: "sge"
                bin: "/opt/sge/bin/lx-amd64"
                sge_root: "/opt/sge/"
batch_connect:
                vnc:
                  script_wrapper: |
                    set +o posix
                    . ~/.bashrc
                    export PATH="/opt/TurboVNC/bin/:$PATH"
                    export WEBSOCKIFY_CMD="/usr/bin/websockify"
                    %s
...

Having that cluster configuration is good, in-fact a better practice becuase it’s global to all apps.

SGE Docs talk about having to configure the job name (the first item you had to comment).
https://osc.github.io/ood-documentation/latest/installation/resource-manager/sge.html#invalid-job-name

I’m not sure why you had to comment account_id.

The language I guess is a little wonky. the submission parameters you’re looking for are infact the batch_connect portion of the sge.yml that you’ve provided.

What’s the actual behaviour and/or error you’re seeing?

1 Like

Jeff,
Thank you for the link on the invalid job name. However, I’m still getting the error when trying to start an interactive session.

Failed to submit session with the following error:

qsub: ERROR! argument to -N option must not contain /

  • If this job failed to submit because of an invalid job name please ask your administrator to configure OnDemand to set the environment variable OOD_JOB_NAME_ILLEGAL_CHARS.
  • The HPC Desktop session data for this session can be accessed under the staged root directory.

My cluster.d file:

v2:
  metadata:
    title: "CoS HPC"
  login:
    host: "submit.server.name"
  job:
    adapter: "sge"
    cluster: "CoS Cluster"
    bin: "/cm/shared/apps/sge/2011.11p1/bin/linux-x64"
    conf: "/cm/shared/apps/sge/2011.11p1/"
    sge_root: "/cm/shared/apps/sge/2011.11p1"
    libdrmaa_path: "/cm/shared/apps/sge/2011.11p1/lib/linux-x64/libdrmaa.so"
batch_connect:
    vnc:
      script_wrapper: |
        set +o posix
        . ~/.bashrc
        export PATH="/opt/TurboVNC/bin/:$PATH"
        export WEBSOCKIFY_CMD="/usr/bin/websockify"
        unset XDG_RUNTIME_DIR
        %s
      set_host: "host=$(hostname)"

My bc_desktop file:

title: "HPC Desktop"
cluster: "coshpc"
attributes:
  bc_num_slots: 1
  bc_queue: "test.q"
  desktop: "xfce"

And the addition to the ngix_stage.yml file:

pun_custom_env:
  - OOD_JOB_NAME_ILLEGAL_CHARS: "/"

pun_custom_env should be a yaml map not an array.

pun_custom_env:
  OOD_JOB_NAME_ILLEGAL_CHARS: "/"

Thanks for that. I got past the -N error and it has submitted my request! However, my submitted job is stuck as pending… There are no other jobs running in the test.q. The error is not helpful at all:

(-l h_rt=3600) cannot run in queue “test.q” because of cluster queue

Does it not like the time limit option?

I don’t know, but you can grep for execve or qsub in /var/log/ondemand-nginx/$USER/error.log to see the actual qsub commands we’re issuing.

That way you find the arguments we’re giving and you can try through the CLI to see which argument is wrong.

Here is whats showing in my error log:

App 50259 output: [2022-04-15 13:30:31 -0700 ]  INFO "execve = [{}, \"/cm/shared/apps/sge/2011.11p1/bin/linux-x64/qsub\", \"-wd\", \"/home/cosine/keistc/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/coshpc/output/6f69a9da-5c36-4488-94cc-5f4f091d51d4\", \"-N\", \"sys-dashboard-sys-bc_desktop-coshpc\", \"-o\", \"/home/cosine/keistc/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/coshpc/output/6f69a9da-5c36-4488-94cc-5f4f091d51d4/output.log\", \"-q\", \"test.q\", \"-l\", \"h_rt=01:00:00\"]"
App 50259 output: [2022-04-15 13:30:31 -0700 ]  INFO "method=POST path=/pun/sys/dashboard/batch_connect/sys/bc_desktop/coshpc/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=302 duration=237.28 view=0.00 location=https://ondemand.science.oregonstate.edu/pun/sys/dashboard/batch_connect/sessions"
App 50259 output: [2022-04-15 13:30:31 -0700 ]  INFO "execve = [{}, \"/cm/shared/apps/sge/2011.11p1/bin/linux-x64/qstat\", \"-r\", \"-xml\", \"-j\", \"744097\"]"

It doesn’t show a script being executed. Anyway tried the following via cli specifying the job_script_content.sh file to run. It again put the job in as pending and never changed:

qsub -wd /home/cosine/keistc/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/coshpc/output/6f69a9da-5c36-4488-94cc-5f4f091d51d4 -N sys-dashboard-sys-bc_desktop-coshpc -o /home/cosine/keistc/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/coshpc/output/6f69a9da-5c36-4488-94cc-5f4f091d51d4/output.log -q test.q -l h_rt=3600 /home/cosine/keistc/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/coshpc/output/6f69a9da-5c36-4488-94cc-5f4f091d51d4/job_script_content.sh

I then tried running the same command but removed the -l h_rt flag. Here the job did try to run but errored out with this in the output.log file:

Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.

job_script_content.sh as you’ve indicated is the entrypoint. I think we pass it to stdin to a lot of scheduler.

At this point it’s a SGE thing, which I’m not super familiar with that system. As you can see, it even has the same behavior through a shell. Meaning, when you remove OOD from the equation, you still have this issue.

It doesn’t seem like you’re able to sumbit to that queue under any circumstance. I’d figure out how to get this working with just the CLI, then tackle what’s going on with OOD.