Debugging 404 in Jupyter on new cluster

Hi all,

We’re setting up OnDemand for a new slurm cluster. We have a working OnDemand instance on two of our other slurm clusters. However, we’re having some issues with the new setup, and I’d like some help debugging a 404 response.

This is the status of our OOD instance:

  • Has working shell access to the login node (through the “Shell Access” dropdown)
  • The Jupyter app can submit jobs that start the jupyter notebook server on the compute node. I’ve verified that the server is running on the compute node by ssh’ing onto the compute node and running jupyter notebook list

However, when I connect to the notebook server using OOD, I get a 404. Here are the details. I’ve double-checked the config files based on some of the other 404-related help posts on the forum, but no luck so far. Any other config files or logs I should check?

Request/response:

POST
	https://ondemand-dev-ice.pace.gatech.edu/node/compute-ice-dev-slurm-5.pace.gatech.edu/8275/login
Status
404
Not Found
VersionHTTP/1.1
Transferred590 B (196 B size)
Referrer Policystrict-origin-when-cross-origin
Request PriorityHighest

Request header:

POST /node/compute-ice-dev-slurm-5.pace.gatech.edu/8275/login HTTP/1.1
Host: ondemand-dev-ice.pace.gatech.edu
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/111.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer: https://ondemand-dev-ice.pace.gatech.edu/pun/sys/dashboard/batch_connect/sessions
Content-Type: application/x-www-form-urlencoded
Content-Length: 25
Origin: https://ondemand-dev-ice.pace.gatech.edu
Connection: keep-alive
Cookie: MOD_AUTH_CAS_S=23179e65d04f90ca70eb770fdbb6a559
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: same-origin
Sec-Fetch-User: ?1
Sec-GPC: 1

Response header:

HTTP/1.1 404 Not Found
Date: Thu, 23 Mar 2023 14:00:11 GMT
Server: Apache/2.4.34 (Red Hat) OpenSSL/1.0.2k-fips
Content-Security-Policy: frame-ancestors https://ondemand-dev-ice.pace.gatech.edu;
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Content-Length: 196
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1

Is there anything in the logs to correlate with the request? I’d check the logs in:
/var/log/httpd/<hostname>_error.log

I’d also check the logs in:
~/ondemand/data/sys/dashboard/batch_connect/sys/<app>/output/<session id>/output.log

To see what commands are running and I’d be especially curious if you see anything that jumps out in the output.log or the connection.yml that do not look right.

Finding some errors in the log would be a good first step.

Also, in your ood_portal.yml how did you configure the reverse proxy? Seeing those fields would help as well to ensure the generated url isn’t wonky.

Here’s the output.log and connnection.yml. The output.log shows that the notebook server is running and listening on the reported port. I don’t see any error logs in /var/log/httpd/

output.log.txt (7.1 KB)
connection.yml (84 Bytes)

And here’s my ood_portal.yml

ood_portal.yml (2.5 KB)

Try adjusting that host_regex as it does not work to capture the pattern for compute-ice-dev-slurm-5.pace.gatech.edu.

I played with it in regex101 and got this to work, but make sure to check it works for you expected cases:
(login|atl1|compute\d*)[\w.-]*\.pace\.gatech\.edu

Hmm, I double checked both of these in regex101:

(login|atl1|compute)[\w.-]*\.pace\.gatech\.edu
(login|atl1|compute\d*)[\w.-]*\.pace\.gatech\.edu

And they both match compute-ice-dev-slurm-5.pace.gatech.edu and compute-ice-dev-slurm-6.pace.gatech.edu (which are the two nodes on our dev cluster). The \w will match [a-zA-Z0-9_], so I think it makes sense that having (login|atl1|compute)[\w.-]* would match compute-ice-dev-slurm-5.

As a sanity check, I did put your regex in ood_portal.yml, but I got the same errors as before.

A coworker who was debugging this noticed this in their nginx log:

App 95505 output: [2023-03-23 11:36:46 -0400 ] ERROR "Session specifies nonexistent 'pace-ice' cluster id."

I double checked that the cluster in form.yml.erb ( which is cluster: "ice-slurm") matched the intended filename (which is /etc/ood/config/clusters.d/ice-slurm.yml, so I’m not sure where the pace-ice cluster ID is coming from.

Which regex are you testing?

In the ood_portal you provided this is the regex I saw and tested:
host_regex: "(login|atl1|compute)[\\w.-]+\\.pace\\.gatech\\.edu"

Which didn’t match the hostname i saw for the compute.

I also want to confirm you issued the update_ood_portal command and restarted ood after the host_regex change.

I’m now using:

host_regex: '(login|atl1|compute\d*)[\w.-]*\.pace\.gatech\.edu'
node_uri: '/node'
rnode_uri: '/rnode'

After making that change, I only ran “Restart Web Server” from the OOD webpage. I hadn’t run update_ood_portal. I can try that and let you know.

Ok, yeah you will need to run that command to load that new config in, which will close connections so be aware if you have active users.

I am curious now though, what are the names of the cluster configs on the file system?

Running update_ood_portal took care of everything, thanks!

We’re working on a dev instance, prior to rolling them out to prod, so running the command didn’t affect anyone.

Awesome! Glad it worked out, let us know if you have any more questions.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.