Bad Gateway and PassengerAgent segfaults

Also realized I skipped answering the LD_PRELOAD question so I added a dump of the environment to my PassengerAgent wrapper, here are some relevant variables for the environment calling PassengerAgent on my Intel test system.

LIBRARY_PATH=/opt/rh/httpd24/root/usr/lib64
PERL5LIB=/opt/rh/rh-git29/root/usr/share/perl5/vendor_perl
X_SCLS=rh-git29 rh-nodejs6 rh-ruby24 httpd24 ondemand 
LD_LIBRARY_PATH=/opt/rh/httpd24/root/usr/lib64:/opt/rh/rh-nodejs6/root/usr/lib64:/opt/rh/rh-ruby24/root/usr/local/lib64:/opt/rh/rh-ruby24/root/usr/lib64:/opt/rh/httpd24/root/usr/lib64:/opt/ood/ondemand/root/usr/lib64
SUDO_USER=apache
PASSENGER_USE_FEEDBACK_FD=true
PATH=/opt/rh/rh-git29/root/usr/bin:/opt/rh/rh-nodejs6/root/usr/bin:/opt/rh/rh-ruby24/root/usr/local/bin:/opt/rh/rh-ruby24/root/usr/bin:/opt/rh/httpd24/root/usr/bin:/opt/rh/httpd24/root/usr/sbin:/opt/ood/ondemand/root/usr/bin:/opt/ood/ondemand/root/usr/sbin:/sbin:/bin:/usr/sbin:/usr/bin
ONDEMAND_VERSION=1.6.7
SUDO_COMMAND=/opt/ood/nginx_stage/sbin/nginx_stage pun -u griznog -a https%3a%2f%2f$[ondemand hostname was here]%3a443%2fnginx%2finit%3fredir%3d%24http_x_forwarded_escaped_uri
PYTHONPATH=/opt/rh/rh-nodejs6/root/usr/lib/python2.7/site-packages
OOD_DEV_APPS_ROOT=/home/griznog/ondemand/dev
RUBYLIB=/opt/ood/ondemand/root/usr/share/ruby/vendor_ruby:/opt/ood/ondemand/root/usr/lib64/ruby/vendor_ruby

No LD_PRELOAD is sneaking in and making it this far.

Good luck on testing this weekend. We just recalled Trey’s note about stateful paths that could be worth reiterating. It may be a useful exercise to just take stock of what directories are stateful. Maybe there’s one that shouldn’t be? Or as Trey notes, there’s one that should be.

I’ve double checked to be sure and this node stores no local state. Local disk is used for temp and scratch space only.

Tried passenger_disable_security_update_check off but that doesn’t change the behavior in any noticeable way.

Using ondemand-passenger rpm built on an AMD node didn’t help, still get the segfaults.

Booting the machine with a fresh 1.3 install works. Live upgrading that to 1.6 immediately results in segfaults and …

502 Bad Gateway
nginx/1.14.0

Just for fun, fresh logs of the segfaults:

Jul 27 10:28:10 ondemandhost kernel: [ 1539.888484] PassengerAgent[83789]: segfault at 0 ip 0000000000487ea3 sp 00007ffd14efcd00 error 4 in PassengerAgent[400000+361000]
Jul 27 10:28:10 ondemandhost kernel: PassengerAgent[83789]: segfault at 0 ip 0000000000487ea3 sp 00007ffd14efcd00 error 4 in PassengerAgent[400000+361000]
Jul 27 10:28:11 ondemandhost kernel: [ 1541.146344] PassengerAgent[84016]: segfault at 0 ip 0000000000487ea3 sp 00007ffe90142800 error 4 in PassengerAgent[400000+361000]
Jul 27 10:28:11 ondemandhost kernel: PassengerAgent[84016]: segfault at 0 ip 0000000000487ea3 sp 00007ffe90142800 error 4 in PassengerAgent[400000+361000]
Jul 27 10:28:12 ondemandhost kernel: [ 1542.416931] PassengerAgent[84243]: segfault at 0 ip 0000000000487ea3 sp 00007ffc7b899c40 error 4 in PassengerAgent[400000+361000]
Jul 27 10:28:12 ondmeandhost kernel: PassengerAgent[84243]: segfault at 0 ip 0000000000487ea3 sp 00007ffc7b899c40 error 4 in PassengerAgent[400000+361000]

I tried to be clever and strace PassengerAgent from my shimmed in wrapper, but I couldn’t figure out how to exec and strace at the same time (is that even possible?) so it didn’t work to fool the nginx process calling PassengerAgent.

At this point I have a working 1.6 test server and a broken server that works with 1.3 but not 1.6. The both boot the same exact image and build configs in the same way, diverging on the basic config stuff like the name of the server, the SSL certs and the ondemand apps installed.

Since I have to leave the production server in a working state, my next plan of attack will be to duplicate as much as possible the apps and configs from the broken server to the test server and hope I get lucky and narrow this down to some config or app problem.

If there a reference for exactly what config items are in play when PassengerAgent is being started/called? I still don’t have a really good grasp of the execution path taken from loading ondemand in the browser to getting the dashboard page.

griznog

Let me start with TL;DR: /etc/security/limits.conf matters, 1.6 uses more memory than 1.3 did for user processes.

Now, the breakdown.

I set up an identical (nearly) node w.r.t. hardware using the identical ondemand and shibboleth config and it worked fine. No help reproducing the error, but this eliminated any esoteric hardware differences as a potential source and gave me two “identical” systems to compare. Several passes over both turned up nothing significant that differed.

Going back to a previous effort, I figured out a way to wrap PassengerAgent such that I could capture an strace. First, I did:

cd /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries
mv PassengerAgent PassengerAgent.bin

Next, created PassengerAgent with this script:

#!/bin/bash
echo $0 $* | logger -t OOD
exec /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent.strace $*

Then created a second wrapper script, PassengerAgent.strace with this:

#!/bin/bash
strace -o /var/log/ondemand-debug/PassengerAgent.strace.${USER}.$$ /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent.bin $*

and last,

mkdir /var/log/ondemand-debug
chmod 1777 /var/log/ondemand-debug

Killed all the existing nginx processes and restart httpd24 just to be safe, hit the ondemand site and in the resulting straces I found:

mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = -1 ENOMEM (Cannot allocate memory)
futex(0x2ab79d09d190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x2d0} ---
close(4)                                = 0
close(5)                                = 0
close(6)                                = 0
close(7)                                = 0
mkdir("/var/tmp/passenger-crash-log.1564333544.25ykyo", 0700) = 0
open("/var/tmp/passenger-crash-log.1564333544.25ykyo", O_RDONLY) = 4
pipe([5, 6])                            = 0
fork()                                  = 70360
close(5)                                = 0
dup2(6, 1)                              = 1
dup2(6, 2)                              = 2
write(2, "\n[ pid=70157, timestamp=15643335"..., 130) = 130
write(2, "[ pid=70157 ] Crash log files wi"..., 141) = 141
fork()                                  = 70361
tgkill(70157, 70157, SIGSTOP)           = 0
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=70157, si_uid=325892} ---
--- stopped by SIGSTOP ---
tgkill(70157, 70157, SIGSEGV)           = 0
rt_sigreturn({mask=[]})                 = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_TKILL, si_pid=70157, si_uid=325892} ---
+++ killed by SIGSEGV +++

The segfault was masking the problem with the memory limit in /etc/security/limits.conf and killing something via this limit doesn’t trigger an OOM message of any kind. Removed the limit and things now work great, at least as far as testing has revealed today. Load should go up on this server tomorrow and after a few days if nothing else breaks I’ll be content to write this off as self-inflicted in the limits.conf file.

What remains a mystery to me is why the identical test server didn’t hit my limits, which at 1GB must have been just close enough to the threshold that it only affected some users some of the time.

(Extra context: the 1GB limit is an artifact of having really low memory in our login nodes, the ondemand server is treated as a login node in my configs. New login servers will remediate this within the next month or so.)

Thanks @efranz, @jeff.ohrstrom and @tdockendorf for your help. Although I have very little feet left at this point, somehow I still manage to shoot myself there with an alarming frequency.

Yea that dumps the ulimits, but I didn’t happen to notice anything strange.

I see that now, even in your earlier post you have passenger core at 10500964kb (10.4 G) VSZ. I’m only running at 1.4.

To be clear you - were hitting this limit?
virtual memory (kbytes, -v) 10485760

I’m reading max memory size doesn’t actually do anything (even from man setrlimit) in kernels 2.4 and above, so v-memory seems the only ulimit I can hit.

But even so, it’s really good to know that a memory limit can trigger this whole mess. Especially if folks start having more limits in cgroups and so on.

I was setting these limits:

*         soft   memlock       8388608
*         hard   memlock      16777216
*         soft   as           10485760
*         hard   as           18874368

So, adjust my previous comments about my limits with * 10, I’m limiting address space to 10G and locked memory to 8G. How that plays out in practice:

numactl: mmap: Cannot allocate memory
[griznog ~]$ memhog 10G
numactl: mmap: Cannot allocate memory
[griznog ~]$ memhog 8G
....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
[griznog ~]$ 

This seems to work per process, so I can start as many 8G processes as I want.

Looking at my OnDemand server this morning it looks like an 8 or 10 GB limit would trigger the behavior we saw, at least for VSZ there are Passenger core processes using as much as ~ 14GB. RSS is much smaller. I’m assuming that even if Passenger isn’t really using this much memory at some point when starting it tries to allocate it, which is why it was failing for me.

If a cgroup approach works the way I think it works, and a user slice limited total user memory across all a user’s processes then the behavior might be even more random depending on how many connections the person had to the server. Seems that if limiting memory per user on the ondemand server is desired, it’d take some careful planning to do right.

Thanks so much for the info, and I apologize for you having to go through this mess without thinking to check limits in in the first place. Let us know if you run into more trouble with 1.6!