Large login group problem

whiskey9cjo · October 8, 2024, 3:40pm

Good morning!

Jeffrey Ohrstrom indicated that I ought to post a Discourse message in the Zoom meeting this morning.

On one of our clusters, all users have a common primary/login group called “students”. This default group also has all users explicitly added to this group. When attempting to login with a user that has this group as their primary/login group, things seem to fall over. *NOTE: you CAN be a member of this group, but if you make it your default/login gid, it will fail.

Here are the ood package versions:
ondemand-release-web-3.1-1.el9.noarch
ondemand-nodejs-3.1.5-1.el9.x86_64
ondemand-runtime-3.1.5-1.el9.x86_64
ondemand-ruby-3.1.5-1.el9.x86_64
ondemand-passenger-6.0.20-1.ood3.1.5.el9.x86_64
ondemand-apache-3.1.5-1.el9.x86_64
ondemand-nginx-1.24.0-1.p6.0.20.ood3.1.5.el9.x86_64
ondemand-gems-3.1.7-1-3.1.7-1.el9.x86_64
ondemand-3.1.7-1.el9.x86_64

Here’s a quick way to create a group with 40k users:
echo “tester:x:1000:$(for i in {1…40000}; do echo -n “tester$i,”; done)”

The error I see in /var/log/ondemand-nginx/tester/error.log :

[ E 2024-10-08 10:45:15.5589 1023/Tl age/Cor/App/Implementation.cpp:221 ]: Could not spawn process for application /var/www/ood/apps/sys/dashboard: An operating system error occurred while preparing to spawn an application process: Error looking up OS group account 1001: Numerical result out of range (errno=34)
Error ID: c87334ad
Error details saved to: /tmp/passenger-error-21DZlW.html

[ E 2024-10-08 10:45:15.5644 1023/T9 age/Cor/Con/CheckoutSession.cpp:281 ]: [Client 2-4] Cannot checkout session because a spawning error occurred. The identifier of the error is c87334ad. Please see earlier logs for details about the error.

Any thoughts?
Thanks!
Christopher Orr

whiskey9cjo · October 8, 2024, 3:44pm

Note that this was originally noticed on machinery that uses the ldap/sssd mechanism to pull in and cache users/groups.

I was able to duplicate on a VM without sssd, simply by populating a group in /etc/group.

jeff.ohrstrom · October 8, 2024, 9:21pm

Thanks for posting on discourse. I will take at this shortly though I have a few things on my plate I should be able to report back later this week or early next week.

jeff.ohrstrom · November 22, 2024, 4:03pm

I can’t replicate initially in a container. I used the same ‘echo’ but didn’t get any errors.

I figure this is either a kernel issue or an nginx issue. I’m testing on a nightly so maybe that’s why I can’t directly replicate. What version of OnDemand are you using and what’s the kernel version? I’m running on kernel 6.8.0.

whiskey9cjo · December 2, 2024, 2:27pm

Hi Jeff!

I’m running kernel 5.14.0-503, Rocky Linux 9.5, OpenOndemand: 3.1.10

The machine I’m running is a basic install of Rocky and OOD on a VM.

Thanks!
-chris

whiskey9cjo · March 10, 2025, 4:51am

HI Jeff!

This evening, we were able to determine a few things.

This seems to be a Ruby error, related to the Phusion web server
If I I have a HUGE group of 40k users in my groups file, but I’m not a member and my primary group comes AFTER the huge group, Phusion will generate this error.
If I move my primary login group’s position to BEFORE the large group, I’m able to login and proceed as normal.

I’m happy to meet with you at some point and do a screen share!

Thanks,
-chris

jeff.ohrstrom · March 10, 2025, 1:30pm

I’m not sure if a meeting would help. If you search “Error looking up OS group account 1001: Numerical result out of range (errno=34)” you get some hits, though mostly just around c++ programs generating errno=34 for a variety of reasons.

I’m not sure if this is coming from Passenger or Nginx, but it would seem one or the other (or both) need to be modified and recompiled to resolve your issue. Some buffer somewhere isn’t big enough to deal with this situation.

jeff.ohrstrom · March 10, 2025, 1:37pm

I think the next step here is to find exactly where this is throwing an error. Both programs are open source, so we can see where to find this once we have the exact line that is erroring.

whiskey9cjo · March 10, 2025, 2:05pm

It appears to be in Passenger in:

src/cxx_supportlib/SystemTools/UserDatabase.cpp

Likely here:

OsUserOrGroup::OsUserOrGroup()
// _SC_GETPW_R_SIZE_MAX is not a maximum:
// Problems with large nix groups and getgrgid_r / getgrnam_r?
: buffer(std::max(1024 128, sysconf(_SC_GETPW_R_SIZE_MAX)))
{
// Do nothing
}

Of course, the website referenced here describes the problem, and I think I interpret the solution to increase the buffer size incrementally until we find the top end?

Topic		Replies	Views
Stability Issues using OnDemand Get Help	4	1061	May 26, 2022
Secondary linux group for interactive app launching (login seems resilient) Get Help question	5	455	September 24, 2022
Is there a way to modify the default `/home/user/ondemand` directory? Get Help question	4	82	March 30, 2025
Installing on Demand on Preexisting Login Servers? Get Help	7	1357	May 26, 2022
OnDemand does not respond to user group change for "Disabling applications" Get Help question	1	44	February 11, 2025

Large login group problem

The error I see in /var/log/ondemand-nginx/tester/error.log :

[ E 2024-10-08 10:45:15.5644 1023/T9 age/Cor/Con/CheckoutSession.cpp:281 ]: [Client 2-4] Cannot checkout session because a spawning error occurred. The identifier of the error is c87334ad. Please see earlier logs for details about the error.

Likely here:

OsUserOrGroup::OsUserOrGroup() // _SC_GETPW_R_SIZE_MAX is not a maximum: // Problems with large *nix groups and getgrgid_r / getgrnam_r? : buffer(std::max(1024 * 128, sysconf(_SC_GETPW_R_SIZE_MAX))) { // Do nothing }

Related topics

OsUserOrGroup::OsUserOrGroup()
// _SC_GETPW_R_SIZE_MAX is not a maximum:
// Problems with large nix groups and getgrgid_r / getgrnam_r?
: buffer(std::max(1024 128, sysconf(_SC_GETPW_R_SIZE_MAX)))
{
// Do nothing
}