Interactive desktop with OOD not running on cluster

Is it possible to run interactive desktop (and by extension interactive apps) if the OOD server is not running directly on a cluster? Is there any special configuration that needs to be setup for such a situation? I’ve been following the instructions presented here:

OOD interactive desktop

I am able to submit the job but after about 2 seconds it goes to complete and never let’s me actually start the desktop. I can’t seem to find any output files (like a Slurm output file) to try and troubleshoot this. I can at least verify that my clusters.d config file is setup and working well enough to submit Slurm jobs and open up terminal session to the cluster.

Hi Tony.

Thanks for your post.

Are you seeing the card that contains the Session ID with Link?

If so, please click on that link. You should find a file called output.log. Please paste the contents of that file in this discourse topic.

Thanks,
-gerald

I already tried looking for that file. When I click on the link I don’t see any output.log file in any of the sessions I have tried to start

The compute nodes on the cluster have XFCE4, TurboVNC, and websockify.

The cluster itself is setup such that the nodes are accessible through our submit node through local IP addresses which leads me back to my original question as to whether or not it is possible to use interactive desktop/apps when OOD is running on a separate server from the cluster.

Just for reference my clusters.d/ yml file is:

--
v2:
  metadata:
    title: "Research Cluster"
    url: "https://dorcfandm.github.io/rcs.github.io/"
  login:
    host: "submit node server"
  job:
    adapter: "slurm"
    submit_host: "submit node server"
    bin: "/usr/bin"
    conf: "/opt/slurm/slurm.conf"
  batch_connect:
    basic:
      script_wrapper: |
        module purge
        %s
      vnc:
        script_wrapper: |
          module purge
          export PATH="/opt/TurboVNC/bin:$PATH"
          export WEBSOCKIFY_CMD="/opt/websockify/run"
          %s

The apps/bc_desktop yml file is:

--
title: "F&M Research Custer"
cluster: "cluster name"
attributes:
   desktop: "xfce"
   bc_queue: null
   bc_account: null
   bc_num_slots: 1

Hi Tony.

Since we are having our monthly open office hours today, maybe you can join us there.

It starts at 11:15 Eastern.

Topic: Open OnDemand Open Office Hours

Time: This is a recurring meeting Meet anytime

Join Zoom Meeting

Meeting ID: 962 9856 8321

Password: 424991

One tap mobile

+16513728299,96298568321#,0#,424991# US (Minnesota)

+13017158592,96298568321#,0#,424991# US (Washington DC)

That’s so strange about requiring the conf configuration in your cluster.d file. I’m still wondering why that may be the case.

I tried to replciate on our systems, and here is the output I saw in /var/log/ondemand-nginx/$USER/error.log.

As you can see I did not specify the SLURM_CONF environment variable. The job launched just fine without it, though I don’t even know where our slurm.conf files are on our login nodes.

execve = [{}, "ssh", "-o", "BatchMode=yes", "-o", "UserKnownHostsFile=/dev/null", "-o", "StrictHostKeyChecking=yes", "owens.osc.edu", "/usr/bin/sbatch", "-D", "/users/PZS0714/johrstrom/ondemand/src/apps/dashboard/data/batch_connect/owens/sys/bc_desktop/owens/output/9edd10fe-0694-410d-859c-c1d8f23b655a", "-J", "ondemand/dev/dashboard/sys/bc_desktop/owens", "-o", "/users/PZS0714/johrstrom/ondemand/src/apps/dashboard/data/batch_connect/owens/sys/bc_desktop/owens/output/9edd10fe-0694-410d-859c-c1d8f23b655a/output.log", "-A", "PZS0714", "-t", "01:00:00", "--export", "NONE", "--nodes", "1", "--ntasks-per-node", "1", "--parsable"]

Thanks. It’s actually helpful to see some output of what the submission looks like. I can say that my log did not look like that at all. I’m running around a bit, but when I get a second I will post what mine looks like. It starts out the same with the batchmode yes and all but after that it’s different

Ok so now job submissions and interactive desktops aren’t working whether I have the conf: line or not in my clusters yml file.

For interactive desktop we now get an error like this:

And if I try to ssh directly to the cluster it’s asking for a password. When we were troubleshooting before and trying to do the echo piped into sbatch I don’t remember that happening. I did double check that the clusters public key in in the OOD’s ssh_known_hosts file and they match.

Also, during the troubleshooting there was a few questions about our Slurm setup, in particular where slurmctld was running. So our setup might be a little odd but we have a head node where slurmctld is running. That head node has very limited access to user. The users have ssh access to what we’re calling our submit node (which doesn’t have slurmctld running on it) but it does have the slurm commands (sbatch etc.) and it also has a copy of our slurm.conf which in our case is in a shared location that all nodes have access to.

Try to ssh -vvvvvv to see what key’s it is using and if that matches your expectation.

It would seem like the keys are out of sync. I.e., they got wiped from the OOD server or they changed on the cluster (the former seems more likely).

The keys all looked ok but just to be sure I moved my .ssh folder, created a new key and copied it to the cluster just to be sure and now as a user I can once again submit job and such. That still leaves my original problem where interactive desktops still aren’t working.

I was watching the Slurm log on my cluster when I submitted an interactive desktop job and it does actually get a Slurm job ID but output.log never shows up on the OOD server to give me any other information.

Output from error.log in ondemand-nginx for that user:

INFO "method=GET path=/pun/sys/dashboard/batch_connect/sys/bc_desktop/rcs-sc/session_contexts/new format=html controller=BatchConnect::SessionContextsController action=new status=200 duration=25.52 view=15.12"
App 16492 output: [2023-02-23 09:49:27 -0500 ]  INFO "execve = [{\"SLURM_CONF\"=>\"/opt/slurm/slurm.conf\"}, \"ssh\", \"-o\", \"BatchMode=yes\", \"-o\", \"UserKnownHostsFile=/dev/null\", \"-o\", \"StrictHostKeyChecking=yes\", \"rcs-scsn.fandm.edu\", \"/usr/bin/sbatch\", \"-D\", \"/home/user/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/rcs-sc/output/a4d5d22d-c981-46c3-9ff8-a95aa952e88d\", \"-J\", \"sys/dashboard/sys/bc_desktop/rcs-sc\", \"-o\", \"/home/user/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/rcs-sc/output/a4d5d22d-c981-46c3-9ff8-a95aa952e88d/output.log\", \"-t\", \"01:00:00\", \"--export\", \"NONE\", \"-N\", \"1\", \"--parsable\"]"
App 16492 output: [2023-02-23 09:49:28 -0500 ]  INFO "method=POST path=/pun/sys/dashboard/batch_connect/sys/bc_desktop/rcs-sc/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=302 duration=512.04 view=0.00 location=https://rcs-grid.fandm.edu/pun/sys/dashboard/batch_connect/sessions"
App 16492 output: [2023-02-23 09:49:28 -0500 ]  INFO "execve = [{\"SLURM_CONF\"=>\"/opt/slurm/slurm.conf\"}, \"ssh\", \"-o\", \"BatchMode=yes\", \"-o\", \"UserKnownHostsFile=/dev/null\", \"-o\", \"StrictHostKeyChecking=yes\", \"rcs-scsn.fandm.edu\", \"/usr/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%a\\u001F%A\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\\u001F%b\", \"-j\", \"59586\"]"
App 16492 output: [2023-02-23 09:49:28 -0500 ]  INFO "method=GET path=/pun/sys/dashboard/batch_connect/sessions format=html controller=BatchConnect::SessionsController action=index status=200 duration=348.24 view=19.88"
App 16492 output: [2023-02-23 09:49:38 -0500 ]  INFO "execve = [{\"SLURM_CONF\"=>\"/opt/slurm/slurm.conf\"}, \"ssh\", \"-o\", \"BatchMode=yes\", \"-o\", \"UserKnownHostsFile=/dev/null\", \"-o\", \"StrictHostKeyChecking=yes\", \"rcs-scsn.fandm.edu\", \"/usr/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%a\\u001F%A\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\\u001F%b\", \"-j\", \"59586\"]"
App 16492 output: [2023-02-23 09:49:39 -0500 ]  INFO "method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index status=200 duration=416.56 view=8.68"

I suspect my problem is the hostaname. For example, on one of our nodes, hostname is n01.cluster and that hostname is not accessible to our OOD server. I added the set_host lines in my clusters config file with no luck. Job still doesn’t start and there still isn’t an output.log file

I did some further checking and found this in the slurm log on the node my interactive desktop request got submitted to.

[2023-02-23T10:30:14.509] [59590.batch] error: Could not open stdout file /home/user/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/rcs-sc/output/48dba91a-1e7b-434d-b080-e27312a9c121/output.log: No such file or directory
[2023-02-23T10:30:14.509] [59590.batch] error: IO setup failed: No such file or directory
[2023-02-23T10:30:14.509] [59590.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256

Looks like the job dies because it can’t write the output.log file. Should I maybe look at the reverse proxy setup? Is that why it can’t write the output file, because it’s trying to write it to the cluster instead of on the OOD server?

Based on my understanding of the way OOD seems to work here (in regard to interactive desktop), we won’t be able to use it without modifying our cluster/OOD setup. In our setup, OOD runs on a server completely separate from our cluster. Not even home directories are shared between the two, which is my bad in not initially understanding the suggested OOD architecture and such.

We will either

  1. Share the home directories on the cluster with the OOD server (via NFS or something similar)
  2. Run OOD directly on our cluster

@aweaver1fandm FYI, our general guidance around the OOD host is that you should configure it / treat it exactly like you would any regular login host/node on your system. It is fundamentally doing the exact same thing as a regular login node (e.g. providing a direct interface between your clients and your resources). I think most people run OOD in a VM (although you don’t have to) and NFS mount client home directories.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.