Let me start with TL;DR: /etc/security/limits.conf
matters, 1.6 uses more memory than 1.3 did for user processes.
Now, the breakdown.
I set up an identical (nearly) node w.r.t. hardware using the identical ondemand and shibboleth config and it worked fine. No help reproducing the error, but this eliminated any esoteric hardware differences as a potential source and gave me two “identical” systems to compare. Several passes over both turned up nothing significant that differed.
Going back to a previous effort, I figured out a way to wrap PassengerAgent such that I could capture an strace. First, I did:
cd /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries
mv PassengerAgent PassengerAgent.bin
Next, created PassengerAgent with this script:
#!/bin/bash
echo $0 $* | logger -t OOD
exec /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent.strace $*
Then created a second wrapper script, PassengerAgent.strace
with this:
#!/bin/bash
strace -o /var/log/ondemand-debug/PassengerAgent.strace.${USER}.$$ /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent.bin $*
and last,
mkdir /var/log/ondemand-debug
chmod 1777 /var/log/ondemand-debug
Killed all the existing nginx processes and restart httpd24 just to be safe, hit the ondemand site and in the resulting straces I found:
mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = -1 ENOMEM (Cannot allocate memory)
futex(0x2ab79d09d190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x2d0} ---
close(4) = 0
close(5) = 0
close(6) = 0
close(7) = 0
mkdir("/var/tmp/passenger-crash-log.1564333544.25ykyo", 0700) = 0
open("/var/tmp/passenger-crash-log.1564333544.25ykyo", O_RDONLY) = 4
pipe([5, 6]) = 0
fork() = 70360
close(5) = 0
dup2(6, 1) = 1
dup2(6, 2) = 2
write(2, "\n[ pid=70157, timestamp=15643335"..., 130) = 130
write(2, "[ pid=70157 ] Crash log files wi"..., 141) = 141
fork() = 70361
tgkill(70157, 70157, SIGSTOP) = 0
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=70157, si_uid=325892} ---
--- stopped by SIGSTOP ---
tgkill(70157, 70157, SIGSEGV) = 0
rt_sigreturn({mask=[]}) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_TKILL, si_pid=70157, si_uid=325892} ---
+++ killed by SIGSEGV +++
The segfault was masking the problem with the memory limit in /etc/security/limits.conf
and killing something via this limit doesn’t trigger an OOM message of any kind. Removed the limit and things now work great, at least as far as testing has revealed today. Load should go up on this server tomorrow and after a few days if nothing else breaks I’ll be content to write this off as self-inflicted in the limits.conf file.
What remains a mystery to me is why the identical test server didn’t hit my limits, which at 1GB must have been just close enough to the threshold that it only affected some users some of the time.
(Extra context: the 1GB limit is an artifact of having really low memory in our login nodes, the ondemand server is treated as a login node in my configs. New login servers will remediate this within the next month or so.)
Thanks @efranz, @jeff.ohrstrom and @tdockendorf for your help. Although I have very little feet left at this point, somehow I still manage to shoot myself there with an alarming frequency.