We’re setting up OnDemand for a new slurm cluster. We have a working OnDemand instance on two of our other slurm clusters. However, we’re having some issues with the new setup, and I’d like some help debugging a 404 response.
This is the status of our OOD instance:
Has working shell access to the login node (through the “Shell Access” dropdown)
The Jupyter app can submit jobs that start the jupyter notebook server on the compute node. I’ve verified that the server is running on the compute node by ssh’ing onto the compute node and running jupyter notebook list
However, when I connect to the notebook server using OOD, I get a 404. Here are the details. I’ve double-checked the config files based on some of the other 404-related help posts on the forum, but no luck so far. Any other config files or logs I should check?
Request/response:
POST
https://ondemand-dev-ice.pace.gatech.edu/node/compute-ice-dev-slurm-5.pace.gatech.edu/8275/login
Status
404
Not Found
VersionHTTP/1.1
Transferred590 B (196 B size)
Referrer Policystrict-origin-when-cross-origin
Request PriorityHighest
Is there anything in the logs to correlate with the request? I’d check the logs in: /var/log/httpd/<hostname>_error.log
I’d also check the logs in: ~/ondemand/data/sys/dashboard/batch_connect/sys/<app>/output/<session id>/output.log
To see what commands are running and I’d be especially curious if you see anything that jumps out in the output.log or the connection.yml that do not look right.
Finding some errors in the log would be a good first step.
Here’s the output.log and connnection.yml. The output.log shows that the notebook server is running and listening on the reported port. I don’t see any error logs in /var/log/httpd/
Try adjusting that host_regex as it does not work to capture the pattern for compute-ice-dev-slurm-5.pace.gatech.edu.
I played with it in regex101 and got this to work, but make sure to check it works for you expected cases: (login|atl1|compute\d*)[\w.-]*\.pace\.gatech\.edu
And they both match compute-ice-dev-slurm-5.pace.gatech.edu and compute-ice-dev-slurm-6.pace.gatech.edu (which are the two nodes on our dev cluster). The \w will match [a-zA-Z0-9_], so I think it makes sense that having (login|atl1|compute)[\w.-]* would match compute-ice-dev-slurm-5.
As a sanity check, I did put your regex in ood_portal.yml, but I got the same errors as before.
I double checked that the cluster in form.yml.erb ( which is cluster: "ice-slurm") matched the intended filename (which is /etc/ood/config/clusters.d/ice-slurm.yml, so I’m not sure where the pace-ice cluster ID is coming from.