We’re setting up OnDemand for a new slurm cluster. We have a working OnDemand instance on two of our other slurm clusters. However, we’re having some issues with the new setup, and I’d like some help debugging a 404 response.
This is the status of our OOD instance:
- Has working shell access to the login node (through the “Shell Access” dropdown)
- The Jupyter app can submit jobs that start the jupyter notebook server on the compute node. I’ve verified that the server is running on the compute node by ssh’ing onto the compute node and running
jupyter notebook list
However, when I connect to the notebook server using OOD, I get a 404. Here are the details. I’ve double-checked the config files based on some of the other 404-related help posts on the forum, but no luck so far. Any other config files or logs I should check?
Transferred590 B (196 B size)
POST /node/compute-ice-dev-slurm-5.pace.gatech.edu/8275/login HTTP/1.1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/111.0
Accept-Encoding: gzip, deflate, br
HTTP/1.1 404 Not Found
Date: Thu, 23 Mar 2023 14:00:11 GMT
Server: Apache/2.4.34 (Red Hat) OpenSSL/1.0.2k-fips
Content-Security-Policy: frame-ancestors https://ondemand-dev-ice.pace.gatech.edu;
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Keep-Alive: timeout=5, max=99
Content-Type: text/html; charset=iso-8859-1
Is there anything in the logs to correlate with the request? I’d check the logs in:
I’d also check the logs in:
To see what commands are running and I’d be especially curious if you see anything that jumps out in the
output.log or the
connection.yml that do not look right.
Finding some errors in the log would be a good first step.
Also, in your
ood_portal.yml how did you configure the reverse proxy? Seeing those fields would help as well to ensure the generated
url isn’t wonky.
output.log shows that the notebook server is running and listening on the reported port. I don’t see any error logs in
output.log.txt (7.1 KB)
connection.yml (84 Bytes)
And here’s my
ood_portal.yml (2.5 KB)
Try adjusting that
host_regex as it does not work to capture the pattern for
I played with it in regex101 and got this to work, but make sure to check it works for you expected cases:
Hmm, I double checked both of these in regex101:
And they both match
compute-ice-dev-slurm-6.pace.gatech.edu (which are the two nodes on our dev cluster). The
\w will match
[a-zA-Z0-9_], so I think it makes sense that having
(login|atl1|compute)[\w.-]* would match
As a sanity check, I did put your regex in
ood_portal.yml, but I got the same errors as before.
A coworker who was debugging this noticed this in their nginx log:
App 95505 output: [2023-03-23 11:36:46 -0400 ] ERROR "Session specifies nonexistent 'pace-ice' cluster id."
I double checked that the cluster in
form.yml.erb ( which is
cluster: "ice-slurm") matched the intended filename (which is
/etc/ood/config/clusters.d/ice-slurm.yml, so I’m not sure where the
pace-ice cluster ID is coming from.
Which regex are you testing?
ood_portal you provided this is the regex I saw and tested:
Which didn’t match the hostname i saw for the compute.
I also want to confirm you issued the
update_ood_portal command and restarted ood after the
I’m now using:
After making that change, I only ran “Restart Web Server” from the OOD webpage. I hadn’t run
update_ood_portal. I can try that and let you know.
Ok, yeah you will need to run that command to load that new config in, which will close connections so be aware if you have active users.
I am curious now though, what are the names of the cluster configs on the file system?
update_ood_portal took care of everything, thanks!
We’re working on a dev instance, prior to rolling them out to prod, so running the command didn’t affect anyone.
Awesome! Glad it worked out, let us know if you have any more questions.
This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.