Background;
I’ve had issues where my OOD instance basically stops working for some users after a while (typically after a few days).
I really don’t have that many users, maybe 20 or so running sessions when it breaks. It typically only breaks for one or two users, other can continue to log in just fine.
It’s not easy to reproduce, and I’ve struggled to get the right information from my logs.
I’ve suspected for a while that it is due to some resource exhaustion on the OS, and some errors on some occasions (not all) clearly indicating such e.g.
ERROR: Cannot fork() a new process: Resource temporarily unavailable
ERROR: boost::thread_resource_error: Resource temporarily unavailable
I’m still not sure if it was caused by my setup being limited or if there is some processes running amok.
My OOD instance is running via podman, launched via a systemd service.
I’ve tried to ensure things like LimitNoFILE
and ulimit isn’t too constrained, but I’m honestly not sure if I’ve missed something simple, because I’m not actually sure what is hitting the limit.
Even frantically clicking around in every app, i can’t manage to reproduce it intentionally myself.
I suspect the files app based on some of the logs (sometimes), and I can break it by traversing to a directory with an enormous amount of files, which results in the error
Error occurred when attempting to access /pun/sys/dashboard/files/fs//cephyr/NOBACKUP/Datasets/OpenImages-V6/images/train
and forcing me to kill the hung process on the server node (which was stuck on 100%), else i can’t load the portal at all. Though, this symptom isn’t quite what users are seeing.
Question: Any tuning of OS resource limits?
Maybe I’m just missing tuning something?
Either for the OS itself, the container, or the systemd service launching it. I’m very open to suggestions!