Good morning!
Jeffrey Ohrstrom indicated that I ought to post a Discourse message in the Zoom meeting this morning.
On one of our clusters, all users have a common primary/login group called “students”. This default group also has all users explicitly added to this group. When attempting to login with a user that has this group as their primary/login group, things seem to fall over. *NOTE: you CAN be a member of this group, but if you make it your default/login gid, it will fail.
Here are the ood package versions:
ondemand-release-web-3.1-1.el9.noarch
ondemand-nodejs-3.1.5-1.el9.x86_64
ondemand-runtime-3.1.5-1.el9.x86_64
ondemand-ruby-3.1.5-1.el9.x86_64
ondemand-passenger-6.0.20-1.ood3.1.5.el9.x86_64
ondemand-apache-3.1.5-1.el9.x86_64
ondemand-nginx-1.24.0-1.p6.0.20.ood3.1.5.el9.x86_64
ondemand-gems-3.1.7-1-3.1.7-1.el9.x86_64
ondemand-3.1.7-1.el9.x86_64
Here’s a quick way to create a group with 40k users:
echo “tester:x:1000:$(for i in {1…40000}; do echo -n “tester$i,”; done)”
The error I see in /var/log/ondemand-nginx/tester/error.log :
[ E 2024-10-08 10:45:15.5589 1023/Tl age/Cor/App/Implementation.cpp:221 ]: Could not spawn process for application /var/www/ood/apps/sys/dashboard: An operating system error occurred while preparing to spawn an application process: Error looking up OS group account 1001: Numerical result out of range (errno=34)
Error ID: c87334ad
Error details saved to: /tmp/passenger-error-21DZlW.html
[ E 2024-10-08 10:45:15.5644 1023/T9 age/Cor/Con/CheckoutSession.cpp:281 ]: [Client 2-4] Cannot checkout session because a spawning error occurred. The identifier of the error is c87334ad. Please see earlier logs for details about the error.
Any thoughts?
Thanks!
Christopher Orr
Note that this was originally noticed on machinery that uses the ldap/sssd mechanism to pull in and cache users/groups.
I was able to duplicate on a VM without sssd, simply by populating a group in /etc/group.
Thanks for posting on discourse. I will take at this shortly though I have a few things on my plate I should be able to report back later this week or early next week.
I can’t replicate initially in a container. I used the same ‘echo’ but didn’t get any errors.
I figure this is either a kernel issue or an nginx issue. I’m testing on a nightly so maybe that’s why I can’t directly replicate. What version of OnDemand are you using and what’s the kernel version? I’m running on kernel 6.8.0.
Hi Jeff!
I’m running kernel 5.14.0-503, Rocky Linux 9.5, OpenOndemand: 3.1.10
The machine I’m running is a basic install of Rocky and OOD on a VM.
Thanks!
-chris
HI Jeff!
This evening, we were able to determine a few things.
- This seems to be a Ruby error, related to the Phusion web server
- If I I have a HUGE group of 40k users in my groups file, but I’m not a member and my primary group comes AFTER the huge group, Phusion will generate this error.
- If I move my primary login group’s position to BEFORE the large group, I’m able to login and proceed as normal.
I’m happy to meet with you at some point and do a screen share!
Thanks,
-chris
I’m not sure if a meeting would help. If you search “Error looking up OS group account 1001: Numerical result out of range (errno=34)” you get some hits, though mostly just around c++ programs generating errno=34
for a variety of reasons.
I’m not sure if this is coming from Passenger or Nginx, but it would seem one or the other (or both) need to be modified and recompiled to resolve your issue. Some buffer somewhere isn’t big enough to deal with this situation.
I think the next step here is to find exactly where this is throwing an error. Both programs are open source, so we can see where to find this once we have the exact line that is erroring.
It appears to be in Passenger in:
src/cxx_supportlib/SystemTools/UserDatabase.cpp
Likely here:
OsUserOrGroup::OsUserOrGroup()
// _SC_GETPW_R_SIZE_MAX is not a maximum:
// Problems with large *nix groups and getgrgid_r / getgrnam_r?
: buffer(std::max(1024 * 128, sysconf(_SC_GETPW_R_SIZE_MAX)))
{
// Do nothing
}
Of course, the website referenced here describes the problem, and I think I interpret the solution to increase the buffer size incrementally until we find the top end?