Unable to get interactive desktops running

jeff.ohrstrom · October 6, 2023, 3:58pm

It’ll show up when the job is in the Running state. you’re using Slurm - so I would use squeue to see what the state of the job is (or the activejobs page in OOD.)

Undetermined state in OOD is a reflection of the state in the scheduler (slurm in this case). Slurm seems to have put these jobs in a weird/bad state that is not Running, Queued, Completed or similar.

uklineale · October 6, 2023, 5:24pm

Weird, neither squeue on the Slurm head node nor the Active Jobs page in OOD show anything running, but the desktop EC2 instance spun up. Seems suspicious, because I successfully submitted a job via the Job Composer that says it’s queued.

And to your previous question, rnode_uri and node_uri are set to those values you listed.

jeff.ohrstrom · October 6, 2023, 5:30pm

That is odd - especially as you see they have job ids - 10 and 12. Maybe inspect sacctmgr or dig into slurmd logs to see what happened to jobs 10 and 12.

uklineale · October 6, 2023, 6:15pm

Ah turns out slurmdbd on OOD failed to connect to the RDS Mysql instance. After restarting it and slurmd on the head node, I see info from a test job in sacct, squeue and logs. Interactive desktop jobs now show up in squeue as “Running” on the Slurm head node, but not on OOD for some reason. OOD still doesn’t see them.

[root@ip-10-0-2-78 ssm-user]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                13   desktop sys/dash    Admin  R      51:36      1 desktop-dy-desktop-cr-1
                14   desktop sys/dash    Admin  R      20:47      1 desktop-dy-desktop-cr-1

jeff.ohrstrom · October 6, 2023, 6:19pm

You have a bin_overrides for sbatch - do you need one for squeue as well?

You can check the same ondemand-nginx error logs for the squeue command we’re issuing and try to replicate it (by issuing the same command on the same machine as the same user)

uklineale · October 6, 2023, 6:25pm

Maybe so. Man, this AWS Workshop is all kinds of broken. We’ve got a meeting with AWS support on Monday, will be giving them plenty of feedback. Really appreciate your help, I’ll work on making an squeue override.

App 879391 output: [2023-10-06 18:23:14 +0000 ]  INFO "execve = [{}, \"/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%a\\u001F%A\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\\u001F%b\", \"-j\", \"16\", \"-M\", \"imaging-poc\"]"
App 879391 output: [2023-10-06 18:23:14 +0000 ] ERROR "squeue: error: No cluster 'imaging-poc' known by database.\nsqueue: error: 'imaging-poc' can't be reached now, or it is an invalid entry for --cluster.  Use 'sacctmgr list clusters' to see available clusters."

jeff.ohrstrom · October 6, 2023, 6:29pm

OK cool - just to reiterate though, we’re passing the -M imaging-poc CLI args because of this setting in your cluster.d file. If you remove this entry in the cluster config file we’ll no longer pass the -M argument to any command we issue.

uklineale · October 6, 2023, 6:47pm

Unfortunately, the scripts from the workshop require the cluster to be set, since it reads the yml file for the host.

Great idea on needing an override for squeue, it worked!!! I was whooping with joy to see green on this page!!

Now when I hit the NoVNC page, it fails to connect to the server.

jeff.ohrstrom · October 6, 2023, 6:56pm

Very cool!

OK - this is going to take a bit of debugging. You can use chrome to open your developer tools in a new tab.

You can check this topic on how to debug.

As a first guess I’d ask what your host_regex is and if you’re suing Basic authentication & Safari (Safari won’t open web sockets over basic authentication).

uklineale · October 6, 2023, 7:06pm

Awesome, I’m on Firefox, but I keep Chrome installed for the developer tools. Will take a look through that topic, thanks.

My host regex is host_regex: 'desktop-dy-.*', and the hostname of the desktop node is desktop-dy-desktop-cr-1 (I believe the -1 get incremented for new hosts).

jeff.ohrstrom · October 6, 2023, 7:08pm

Cool, host_regex looks OK. I would also check network connectivity between the OOD instance and that newly created ec2 instance. As you’re in AWS you may need to supply network routes between the 2.

uklineale · October 6, 2023, 7:16pm

I’ll check that. What ports does the desktop ec2 need open? Websockify looks to listen on a random port.

It’s not using basic auth. Credentials are sent in the query params. It looks like it’s a 502 Bad Gateway HTTP code.

jeff.ohrstrom · October 6, 2023, 7:19pm

It depends on the DISPLAY you open. Here you’ve opened DISPLAY 2 so the port is 5900 + DISPLAY.

So you’d likely have to open the range from 5901-(5901+the largest number of concurrent desktops you expect.)

404 or 403 I would have expected, but not 502. I would look into apache’s logs to see what this is all about.

uklineale · October 6, 2023, 7:32pm

The firewall for the desktop ec2 is open to the portal, and I can send communications with netcat alright.

Is that httpd logs on OOD or on the desktop ec2?

jeff.ohrstrom · October 6, 2023, 7:37pm

httpd is only running on OOD (unless you boot it on your desktops). But yea httpd on OOD is where I’d look.

uklineale · October 6, 2023, 8:12pm

Weird, httpd shows a 200 for the request. Chrome shows no HTTP code, Firefox shows the 502.

unix: - - [06/Oct/2023:20:10:18 +0000] "GET /pun/sys/dashboard/noVNC-1.1.0/package.json HTTP/1.1" 200 2314 "https://iwd.aws-research-7225140000-d3b-sandbox01-dev.aws.cloud.chop.edu/pun/sys/dashboard/noVNC-1.1.0/vnc.html?utf8=%E2%9C%93&autoconnect=true&path=rnode%2Fdesktop-dy-desktop-cr-1%2F50242%2Fwebsockify&resize=remote&password=****&compressionsetting=6&qualitysetting=2&commit=Launch+Desktop%3A+Imaging+Poc" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/118.0" "163.116.145.51, 10.0.0.161"

Other than that, I didn’t really see any logs for the VNC server.

jeff.ohrstrom · October 6, 2023, 8:20pm

That’s not the right request, it should be the very next one. Could be using wss:// protocol instead of https://.

Basically you get the page (that’s the request you’re looking at) then the javascript on the page tries to make a websocket connection. You’re failing on the 2nd bit - getting a websocket connection.

uklineale · October 6, 2023, 8:26pm

Hmm, nothing in the nginx logs about wss://, but I see Chrome and Firefox attempt to open that websocket connection.

uklineale · October 6, 2023, 8:30pm

That domain name leads to an AWS ALB, so maybe that’s dropping the websocket connection…

jeff.ohrstrom · October 6, 2023, 9:38pm

Could be. you can check the shell app to see if it’s exhibiting the same behavior (it opens a websocket too).

Topic		Replies	Views
Interactive desktop with OOD not running on cluster Get Help	12	810	September 2, 2023
OOD 1.5.5: Cluster config: batch_connect: vnc: environment settings issue Get Help question	10	2484	May 26, 2022
OOD launching desktop on head node, not compute node Get Help	6	259	April 21, 2024
SLURM Interactive Desktop job not launching desktop, output.log is empty Get Help ondemand2 , question	5	388	December 25, 2022
Setup OpenOndemand in AWS ParallelCluster Get Help question	2	674	May 26, 2022

Unable to get interactive desktops running

Related topics