Hello, we have experienced some stability issues with Open OnDemand with a varying number of simultaneous users (as low as 10 to as many as 80). We use this cluster to run training sessions and bootcamps, and OOD is a key part of our workflow.
We deploy OOD using NVIDIA DeepOps, which uses the OSC OOD Ansible role under the hood. We don’t expose the cluster login node to the Internet directly, but instead have our users SSH tunnel to connect, so that they connect via the tunnel to http://localhost:9090 .
The two main issues we’re experiencing are:
A subset of users who successfully logged in to the cluster were not able to open the ondemand page (http://localhost:9090/) and launch the labs. (I tested this personally using their credentials)
The errors reported include
Internal Server Error
502 bad gateway error on the page
failed to map user
403 forbidden error on the page
A subset of users who successfully logged in to the cluster and launched the labs received errors as below:
Service Unavailable
The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.Apache/2.4.29 (Ubuntu) Server at localhost Port 9090
Our workaround for both of these is to
sudo killall nginx
But doing this in the middle of a bootcamp kills all the processes and effects all the users, so we are hoping to find the root of the problem so that we don’t have to resort to these measures.
We are requesting help with this issue. What next steps would you recommend for gathering information or addressing this issue?
The failed to map user 403 forbidden error on the page could be some sssd caching issue. You say that the user has to ssh into the box first, then connect to OOD?
I’d checkout out /var/log/apache2 for the 502 GW error and /var/log/messages our journalctl for some of the 503s.
I’d wonder if some of the 2nd error is load on apache or ulimits. I know apache2 starts with some silly defaults (like MaxKeepAliveRequests 100), maybe tweak these to increase capacity? Also look into the apache event mpm and increasing it’s settings (or the worker mpm if you want simpler config settings).
Also you can have users bookmark localhost:9000/nginx/stop?redir=/pun/sys/dashboard/ so they can restart their own PUNs when required.
[ (111)Connection refused: AH02454: HTTP: attempt to connect to Unix domain socket /var/run/ondemand-nginx/u00u5sy0nohcJdb8W9357/passenger.sock (*) failed
We think these errors are suggesting that we might be hitting some ulimit resource limits. Any suggestions on how to increase these limits specifically for the OOD processes would be greatly appreciated!
I found that we set this file so root has unlimited processes, so that maybe it because we initialize things as root and then fork into the user.
[~()] 🐯 cat /etc/security/limits.d/20-nproc.conf
# This file is being maintained by Puppet.
# DO NOT EDIT
* soft nproc 4096
root soft nproc unlimited
We run RHEL 7.9 and here are my ulimits as regular user. I’m having trouble sudoing into root to see it’s ulimits, but from searching /etc/security/ that’s the only override I came up with, to nproc.
[~()] 🐼 ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256899
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited