DCV interactive app setup

Hi,
I was setting up DCV as an interactive app in OOD. Here are my scripts-

Form.yml

---

attributes:

  cluster: "hpc-cluster-new"

  desktop: "dcv"

  cpu_cores:

    widget: select

    help: "CPU Cores for dcv session"

    options:

      - [ "vCPUs=1", "1" ]

      - [ "vCPUs=2", "2" ]

      - [ "vCPUs=4", "4" ]

      - [ "vCPUs=6", "6" ]

      - [ "vCPUs=8", "8" ]

    label: "CPU Cores"
 
  memory:

    widget: select

    help: "RAM"

    options:

      - [ "Memory=4GB", "4" ]

      - [ "Memory=8GB", "8" ]

      - [ "Memory=16GB", "16" ]

      - [ "Memory=32GB", "32" ]

    label: "Memory"
 
  gpu:

    widget: select

    help: "GPU"

    options:

      - [ "GPU=1", "1" ]

      - [ "GPU=2", "2" ]

      - [ "GPU=3", "3" ]

      - [ "GPU=4", "4" ]

    label: "GPU"
 
  session_timeout:

    widget: select

    options:

      - [ "5 minutes", "5m" ]

      - [ "1 hour",    "1h" ]

      - [ "2 hours",   "2h" ]

      - [ "4 hours",   "4h" ]

      - [ "1 day",     "1d" ]

      - [ "4 days",    "4d" ]

    label: "Session timeout"
 
form:

  - desktop

  - cpu_cores

  - memory

  - gpu

  - session_timeout
 

submit.yml.erb

---
cluster: "hpc-dev-cluster"
batch_connect:
  templates: "dcv"
script:
  job_name: "dcv"
  queue_name: "dcv"
  native:
    - "--exclusive"
    - "--cpus-per-task=<%= cpu_cores %>"
    - "--mem=<%= memory %>G"
    - "--gres=gpu:<%= gpu %>"
    - "--export"
    - "DCV_SESSION_TIMEOUT=<%= session_timeout %>"

I want the job to sleep for the specified duration, it was working earlier but it stopped working suddenly and the job goes into completed state in a few seconds and also there is no output file which i can examine for errors.

My before script is the default one, cleanup just removes a file,
after script creates the session and everything which is working fine, i verified I think the problem is my

script.sh.erb (intended to sleep for required time)

#!/bin/bash
  
# Change working directory to user's home directory

cd "${HOME}"
 
# Ensure that the user's configured login shell is used

export SHELL="$(getent passwd $USER | cut -d: -f7)"
 
declare -p >> dcv.log
 
# Start up desktop

echo "Launching desktop '<%= context.desktop %>'..." >> dcv.log

source "<%= session.staged_root.join("desktops", "#{context.desktop}.sh") %>" >> dcv.log

echo "Desktop '<%= context.desktop %>' ended..." >> dcv.log
 
if [ -n "${DCV_SESSION_TIMEOUT}" ]; then

    echo "Sleeping for session timeout of ${DCV_SESSION_TIMEOUT}, close in case of kills"

    # Convert session timeout to seconds (assumes format like "1 hour" or "60 minutes")

   # TIMEOUT_SECONDS=$(date -d "${DCV_SESSION_TIMEOUT}" +%s 2>/dev/null)

    if [ $? -eq 0 ]; then

        sleep ${DCV_SESSION_TIMEOUT} || {

            echo \"Sleep interrupted, closing session.\" >> dcv.log

            exit 1

        }

    else

        echo "Invalid session timeout format: ${DCV_SESSION_TIMEOUT}" >> dcv.log

        exit 1

    fi

fi


I want the job to keep running till the specified duration but its not. It was working fine earlier…

Hi @ jeff.ohrstrom
could you help me with this please?

Why are you redirecting to dvc.log? Doesn’t your scheduler capture stdout and stderror in output.log?

Also not sure why you’re using so much logic to the script - can’t you just rely on the schedulers’ ability to delete the job after a specified time?

Lastly I’d wonder about a script blocking vs going into the background. If you launch processes in the background, the script will exit directly after it issues that command. The scheduler in turn believes the job is complete because the script has exited. So it’s quite important that whatever commands you issue run in the foreground and block the script to ensure the scheduler doesn’t stop the job.

I want it to remain in running state instead it goes in completed. That’s why I am using sleep for the requested time.

For a while it goes into starting, then completed.

THis is my cluster configuration. I am using OKTA OIDC for authentication in my OOD portal.Maybe that’s the issue

---
v2:
  metadata:
    title: "HPC Cluster"
    url: "https://localhost"
  login:
    host: "localhost"
    user: "%{user}"
    default: true
    auth: "ssh"
  job:
    adapter: "slurm"
    cluster: "<name>"
    bin: "/usr/bin"
    strict_host_checking: false
    ssh:
      UsePAM: true
    auth: "password"
    forward_ssh_key: false
  batch_connect:
    basic:
      script_wrapper: |
        module purge
        %s

my job is failing

sacct -j 14450 --format=JobID,JobName,Partition,State,ExitCode,Start,End

JobID           JobName  Partition      State ExitCode               Start                 End 
------------ ---------- ---------- ---------- -------- ------------------- ------------------- 
14450               dcv        dcv     FAILED      1:0 2025-06-07T05:13:41 2025-06-07T05:14:02 
14450.batch       batch                FAILED      1:0 2025-06-07T05:13:41 2025-06-07T05:14:02 
14450.extern     extern             COMPLETED      0:0 2025-06-07T05:13:41 2025-06-07T05:14:02

I referred to this

another error in the portal..

Exception: OodApp::SetupScriptFailed

Per user setup failed for script at /var/www/ood/apps/sys/myjobs/./bin/setup-production for user ADM_rsinghal4 with output: /usr/local/share/gems/gems/bundler-2.5.5/lib/bundler/vendor/thor/lib/thor/command.rb:2:in `<class:Thor>': superclass mismatch for class Command (TypeError)
  from /usr/local/share/gems/gems/bundler-2.5.5/lib/bundler/vendor/thor/lib/thor/command.rb:1:in `<top (required)>'
  from /usr/local/share/gems/gems/bundler-2.5.5/lib/bundler/vendor/thor/lib/thor/base.rb:1:in `require_relative'
  from /usr/local/share/gems/gems/bundler-2.5.5/lib/bundler/vendor/thor/lib/thor/base.rb:1:in `<top (required)>'
  from /usr/local/share/gems/gems/bundler-2.5.5/lib/bundler/vendor/thor/lib/thor.rb:1:in `require_relative'
  from /usr/local/share/gems/gems/bundler-2.5.5/lib/bundler/vendor/thor/lib/thor.rb:1:in `<top (required)>'
  from /usr/local/share/gems/gems/bundler-2.5.5/lib/bundler/vendored_thor.rb:8:in `require_relative'
  from /usr/local/share/gems/gems/bundler-2.5.5/lib/bundler/vendored_thor.rb:8:in `<top (required)>'
  from /usr/local/share/gems/gems/bundler-2.5.5/lib/bundler/friendly_errors.rb:3:in `require_relative'
  from /usr/local/share/gems/gems/bundler-2.5.5/lib/bundler/friendly_errors.rb:3:in `<top (required)>'
  from /usr/share/ruby/bundled_gems.rb:75:in `require'
  from /usr/share/ruby/bundled_gems.rb:75:in `block (2 levels) in replace_require'
  from /usr/local/share/gems/gems/bundler-2.5.5/exe/bundle:18:in `<top (required)>'
  from /var/www/ood/apps/sys/myjobs/bin/bundle:3:in `load'
  from /var/www/ood/apps/sys/myjobs/bin/bundle:3:in `<main>'
 

I’m not sure about the very last error, seems like some library issue, but I’ve never seen that before.

As to the job failing, still not seeing any output.log so I can’t tell where it’s failing. Clearly from slurmdb it’s exiting 1 somewhere, but without the output.log I can’t say where. Maybe using set -x somewhere will help?