OOD launching desktop on head node, not compute node

Hello,

I’m trying to get a interactive desktop running, but OOD appears to launch a session in the wrong queue. I’m using OOD 3.0.3 and a Slurm cluster started via Parallel Cluster.

It spins up a node from the desktop queue, but for some reason runs on the head node (identified by clicking “Open in Terminal” from the session ID and getting the IP).

Where is this controlled?

Look at your nginx logs for the execve command we submit. You can see the exact command we use to submit the job.

Given that command line string - you can then ask yourself does it need to specify the queue or should a Slurm filter or similar put it into the correct queue.

It seems like either you’re not specifying the queue when you should or Slurm itself is erroneously placing you in the wrong queue.

Slurm is controlling where the job is actually run. Again, the execve output will tell you the exact sbatch command we issue. At that point - it’s out of our hands and in the schedulers’.

Gotcha, looks like the command is right and Slurm sees it as the right queue, puts it on the right node, etc. I fixed my cluster submit.yml.erb and desktop config to use the new files for 3.0.3.

I also noticed my previous test of which node it’s running on is invalid. I have a working cluster (the one you helped me with previously, thank you again) that works but exhibits the same behavior.

Currently this is my error on the node:

/var/spool/slurmd/job00002/slurm_script: line 136: vncpasswd: command not found
Starting VNC server...
/var/spool/slurmd/job00002/slurm_script: line 145: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 153: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 145: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 153: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 145: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 153: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 145: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 153: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 145: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 153: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 145: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 153: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 145: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 153: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 145: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 153: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 145: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 153: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 145: vncserver: command not found
/var/spool/slurmd/job00002/slurm_script: line 153: vncserver: command not found
Cleaning up...
/var/spool/slurmd/job00002/slurm_script: line 23: vncserver: command not found

I’ve SSH’d into the node to confirm TurboVNC is installed, and vncserver and crew are in PATH. It seems like job_script_content.sh is where these errors are coming from, and like it’s not getting the PATH variable set in /etc/bashrc. Where does the job_script_content.sh get generated from? How does it get run?

Ah, actually I just realized I’m missing some config! I saw this in our old OOD v2 instance, but not sure where to put the script wrappers in v3.0.3.

# /etc/ood/config/apps/bc_desktop/imaging-poc.yml
v2:
  metadata:
    title: Imaging Desktop
  job:
    adapter: "slurm"
    bin_overrides:
      sbatch: "/etc/ood/config/bin_overrides.py"
      squeue: "/etc/ood/config/squeue_override.py"
  batch_connect:
    basic:
      script_wrapper: |
        module purge
        %s
    vnc:
      script_wrapper: |
        module purge
        export PATH="/usr/local/turbovnc/bin:$PATH"
        export WEBSOCKIFY_CMD="/usr/local/websockify/run"
        %s

Same spot - that same cluster.d file will continue to work in 3.0.

Excellent, adding that to the appropriate clusters.d file did the trick! Thanks Jeff!

1 Like

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.