It’ll show up when the job is in the Running state. you’re using Slurm - so I would use squeue to see what the state of the job is (or the activejobs page in OOD.)
Undetermined state in OOD is a reflection of the state in the scheduler (slurm in this case). Slurm seems to have put these jobs in a weird/bad state that is not Running, Queued, Completed or similar.
Weird, neither squeue on the Slurm head node nor the Active Jobs page in OOD show anything running, but the desktop EC2 instance spun up. Seems suspicious, because I successfully submitted a job via the Job Composer that says it’s queued.
And to your previous question, rnode_uri and node_uri are set to those values you listed.
That is odd - especially as you see they have job ids - 10 and 12. Maybe inspect sacctmgr or dig into slurmd logs to see what happened to jobs 10 and 12.
Ah turns out slurmdbd on OOD failed to connect to the RDS Mysql instance. After restarting it and slurmd on the head node, I see info from a test job in sacct, squeue and logs. Interactive desktop jobs now show up in squeue as “Running” on the Slurm head node, but not on OOD for some reason. OOD still doesn’t see them.
[root@ip-10-0-2-78 ssm-user]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
13 desktop sys/dash Admin R 51:36 1 desktop-dy-desktop-cr-1
14 desktop sys/dash Admin R 20:47 1 desktop-dy-desktop-cr-1
You have a bin_overrides for sbatch - do you need one for squeue as well?
You can check the same ondemand-nginx error logs for the squeue command we’re issuing and try to replicate it (by issuing the same command on the same machine as the same user)
Maybe so. Man, this AWS Workshop is all kinds of broken. We’ve got a meeting with AWS support on Monday, will be giving them plenty of feedback. Really appreciate your help, I’ll work on making an squeue override.
App 879391 output: [2023-10-06 18:23:14 +0000 ] INFO "execve = [{}, \"/bin/squeue\", \"--all\", \"--states=all\", \"--noconvert\", \"-o\", \"\\u001E%a\\u001F%A\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\\u001F%b\", \"-j\", \"16\", \"-M\", \"imaging-poc\"]"
App 879391 output: [2023-10-06 18:23:14 +0000 ] ERROR "squeue: error: No cluster 'imaging-poc' known by database.\nsqueue: error: 'imaging-poc' can't be reached now, or it is an invalid entry for --cluster. Use 'sacctmgr list clusters' to see available clusters."
OK cool - just to reiterate though, we’re passing the -M imaging-poc CLI args because of this setting in your cluster.d file. If you remove this entry in the cluster config file we’ll no longer pass the -M argument to any command we issue.
OK - this is going to take a bit of debugging. You can use chrome to open your developer tools in a new tab.
You can check this topic on how to debug.
As a first guess I’d ask what your host_regex is and if you’re suing Basic authentication & Safari (Safari won’t open web sockets over basic authentication).
Awesome, I’m on Firefox, but I keep Chrome installed for the developer tools. Will take a look through that topic, thanks.
My host regex is host_regex: 'desktop-dy-.*', and the hostname of the desktop node is desktop-dy-desktop-cr-1 (I believe the -1 get incremented for new hosts).
Cool, host_regex looks OK. I would also check network connectivity between the OOD instance and that newly created ec2 instance. As you’re in AWS you may need to supply network routes between the 2.
That’s not the right request, it should be the very next one. Could be using wss:// protocol instead of https://.
Basically you get the page (that’s the request you’re looking at) then the javascript on the page tries to make a websocket connection. You’re failing on the 2nd bit - getting a websocket connection.