Unable to get interactive desktops running

It’ll show up when the job is in the Running state. you’re using Slurm - so I would use squeue to see what the state of the job is (or the activejobs page in OOD.)

Undetermined state in OOD is a reflection of the state in the scheduler (slurm in this case). Slurm seems to have put these jobs in a weird/bad state that is not Running, Queued, Completed or similar.

Weird, neither squeue on the Slurm head node nor the Active Jobs page in OOD show anything running, but the desktop EC2 instance spun up. Seems suspicious, because I successfully submitted a job via the Job Composer that says it’s queued.

And to your previous question, rnode_uri and node_uri are set to those values you listed.

That is odd - especially as you see they have job ids - 10 and 12. Maybe inspect sacctmgr or dig into slurmd logs to see what happened to jobs 10 and 12.

Ah turns out slurmdbd on OOD failed to connect to the RDS Mysql instance. After restarting it and slurmd on the head node, I see info from a test job in sacct, squeue and logs. Interactive desktop jobs now show up in squeue as “Running” on the Slurm head node, but not on OOD for some reason. OOD still doesn’t see them.

[root@ip-10-0-2-78 ssm-user]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                13   desktop sys/dash    Admin  R      51:36      1 desktop-dy-desktop-cr-1
                14   desktop sys/dash    Admin  R      20:47      1 desktop-dy-desktop-cr-1

You have a bin_overrides for sbatch - do you need one for squeue as well?

You can check the same ondemand-nginx error logs for the squeue command we’re issuing and try to replicate it (by issuing the same command on the same machine as the same user)

Maybe so. Man, this AWS Workshop is all kinds of broken. We’ve got a meeting with AWS support on Monday, will be giving them plenty of feedback. Really appreciate your help, I’ll work on making an squeue override.

App 879391 output: [2023-10-06 18:23:14 +0000 ]  INFO "execve = [{}, \"/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%a\\u001F%A\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\\u001F%b\", \"-j\", \"16\", \"-M\", \"imaging-poc\"]"
App 879391 output: [2023-10-06 18:23:14 +0000 ] ERROR "squeue: error: No cluster 'imaging-poc' known by database.\nsqueue: error: 'imaging-poc' can't be reached now, or it is an invalid entry for --cluster.  Use 'sacctmgr list clusters' to see available clusters."

OK cool - just to reiterate though, we’re passing the -M imaging-poc CLI args because of this setting in your cluster.d file. If you remove this entry in the cluster config file we’ll no longer pass the -M argument to any command we issue.

Unfortunately, the scripts from the workshop require the cluster to be set, since it reads the yml file for the host.

Great idea on needing an override for squeue, it worked!!! I was whooping with joy to see green on this page!!

Now when I hit the NoVNC page, it fails to connect to the server.

Very cool!

OK - this is going to take a bit of debugging. You can use chrome to open your developer tools in a new tab.

You can check this topic on how to debug.

As a first guess I’d ask what your host_regex is and if you’re suing Basic authentication & Safari (Safari won’t open web sockets over basic authentication).

Awesome, I’m on Firefox, but I keep Chrome installed for the developer tools. Will take a look through that topic, thanks.

My host regex is host_regex: 'desktop-dy-.*', and the hostname of the desktop node is desktop-dy-desktop-cr-1 (I believe the -1 get incremented for new hosts).

Cool, host_regex looks OK. I would also check network connectivity between the OOD instance and that newly created ec2 instance. As you’re in AWS you may need to supply network routes between the 2.

I’ll check that. What ports does the desktop ec2 need open? Websockify looks to listen on a random port.

It’s not using basic auth. Credentials are sent in the query params. It looks like it’s a 502 Bad Gateway HTTP code.

It depends on the DISPLAY you open. Here you’ve opened DISPLAY 2 so the port is 5900 + DISPLAY.

So you’d likely have to open the range from 5901-(5901+the largest number of concurrent desktops you expect.)

404 or 403 I would have expected, but not 502. I would look into apache’s logs to see what this is all about.

The firewall for the desktop ec2 is open to the portal, and I can send communications with netcat alright.

Is that httpd logs on OOD or on the desktop ec2?

httpd is only running on OOD (unless you boot it on your desktops). But yea httpd on OOD is where I’d look.

Weird, httpd shows a 200 for the request. Chrome shows no HTTP code, Firefox shows the 502.

unix: - - [06/Oct/2023:20:10:18 +0000] "GET /pun/sys/dashboard/noVNC-1.1.0/package.json HTTP/1.1" 200 2314 "https://iwd.aws-research-7225140000-d3b-sandbox01-dev.aws.cloud.chop.edu/pun/sys/dashboard/noVNC-1.1.0/vnc.html?utf8=%E2%9C%93&autoconnect=true&path=rnode%2Fdesktop-dy-desktop-cr-1%2F50242%2Fwebsockify&resize=remote&password=****&compressionsetting=6&qualitysetting=2&commit=Launch+Desktop%3A+Imaging+Poc" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/118.0" "163.116.145.51, 10.0.0.161"

Other than that, I didn’t really see any logs for the VNC server.

That’s not the right request, it should be the very next one. Could be using wss:// protocol instead of https://.

Basically you get the page (that’s the request you’re looking at) then the javascript on the page tries to make a websocket connection. You’re failing on the 2nd bit - getting a websocket connection.

Hmm, nothing in the nginx logs about wss://, but I see Chrome and Firefox attempt to open that websocket connection.

That domain name leads to an AWS ALB, so maybe that’s dropping the websocket connection…

Could be. you can check the shell app to see if it’s exhibiting the same behavior (it opens a websocket too).