Remote desktop fail (and weird partial fix!)

So I had OOD v3.0.1, installed on RockyLinux 8 via the ansible role.

After building a new image which included dnf update before the OOD install, the remote desktop started failing with

/var/spool/slurm/job00002/slurm_script: line 3: module: command not found
Setting VNC password...
Error: no HOME environment variable

The relevant bits (I think!) of the config are shown below - from that it’s clear for some reason the lmod module function isn’t available any more inside the vnc script (which has worked fine for maybe a couple of years).

I could “fix” this by adding

        - <%= "--export=ALL" %>

to the “native” config for the desktop script but a) in the remote desktop, when opening a terminal it just closed again immediately [no idea how to debug that!] and b) given exporting all the user’s environment is the default, that should do nothing…

Same behavior on v3.0.3.

I also tried adding #!/usr/bin/env bash to the top of the vnc.script_wrapper, that didn’t help.

To be honest the config is a bit cobbled-together from things I found, but I have looked at the current desktop example for v3 and it looks similar.

Clearly there’s something going wrong with shell/environment vars but I am at a lost. Any suggestions appreciated.

Config:

openondemand_clusters:
  slurm:
    v2:
      ...
      batch_connect:
        basic:
          script_wrapper: |
            module purge
            export PATH=/opt/jupyter-py39/bin/:$PATH
            %s
          set_host: host=$(hostname -s)
        vnc:
          script_wrapper: |
            module purge

            export PATH=/opt/TurboVNC/bin:$PATH

            # Workaround to avoid "Unable to contact settings server" when
            # lauching xfce4-session
            xfce4-session() { /bin/dbus-launch /bin/xfce4-session $@ ; }
            export -f xfce4-session
            %s
          set_host: host=$(hostname -s)

and

openondemand_apps_desktop_default:
  title: Remote Desktop
  description: Request a desktop to run GUI applications.
  cluster: slurm
  form:
    - desktop
    - bc_queue
    - bc_num_hours
    - num_cores
    - node
  attributes:
    desktop: xfce
    # bc_account: # i.e. slurm account
    #   value: root
    bc_queue:
      value: "{{ openondemand_desktop_partition | default(none) }}"
    num_cores:
      label: Number of cores
      value: 1
    node:
      label: Node name
      help: Select a particular node or leave empty to let Slurm pick the next available
      value: ""
  submit: |
    ---
    script:
      job_name: "ood-desktop"
      native:
        - <%= "--nodes=1" %>
        - <%= "--ntasks=#{num_cores}" %>
        - <%= "--nodelist=#{node}" %>

This thread from from OOD v1.6 suggests maybe having to add --export=all or equivalent is expected (although I’ve never had to).

Trying to debug why a terminal in the xfce desktop closes I tried running xfce4-terminal -e 'sleep 10', but there’s no output/errors visible.

Some other threads (1), (2) make it look like having to add

script:
  copy_environment: true

is expected, at least in older versions. This fixes remote desktop start (still working on terminal failure but I think it’s unrelated). Should this be documented/automatic? Maybe it is and I’ve missed it/messed up the default config?

  1. Interactive jobs started inside Interactive Desktop don't export environment variables (Slurm)
  2. Ondemand with slurm based sytems, sbatch?

Sorry for the delay, I’ll read into this a little bit more, but I can’t tell off hand what exactly the issue is at this time.

I just updated Slurm documentation earlier this week for the same. It was before my time when this was developed, so I can’t say why they chose --export=NONE but they did. I suspect it’s because of the PATH environment variable that is very specialized in the PUN when you submit jobs. That is, the PATH on the webnode at the time of job submission is very specific to the webserver and is not reliable outside of it.

Thanks for the reply - I’m guessing that updated docs will land at 4. Custom Job Submission — Open OnDemand 3.0.3 documentation when the docs get released? Using copy_environment seems to have fixed it, although I’m unsure why I haven’t needed that before.

The terminal problem turned out to be totally unrelated, it just seemed very plausible that it was connected :scream:.

https://osc.github.io/ood-documentation/latest/installation/resource-manager/slurm.html

I added them here on the Slurm docs, but I’m guessing it could go there too!

Hmm turns out using copy_environment isn’t a total fix. In a terminal in the remote desktop MODULEPATH isn’t set, so no modules. I’ll keep digging and try to get a working system up for comparison.

In case it helps anyone else, here’s the answer, the two problems were related but the other way around to what I expected and it wasn’t OOD related: The user I was testing with had shell set to /sbin/nologin on the compute nodes :man_facepalming:. So, using copy_environment presumably overwrote that, and allowed the desktop job to start. But then the environment is all messed up, so there were no modules.

Fixing the user’s shell definition and removing copy_environment it all works.