Bad Gateway and PassengerAgent segfaults

Hi all,

I upgraded to 1.6 (from 1.3) about a week ago and everything seemed to work fine until today. Since this morning random users (including me at times) are getting

502 Bad Gateway
nginx/1.14.0

and on my OnDemand server I am seeing a lot of this in dmesg:

[ 7889.698774] PassengerAgent[31555]: segfault at 0 ip 0000000000487ea3 sp 00007ffc0ac54a00 error 4 in PassengerAgent[400000+361000]
[ 7889.971281] PassengerAgent[31810]: segfault at 0 ip 0000000000487ea3 sp 00007ffe14c47000 error 4 in PassengerAgent[400000+361000]
[ 7890.452068] PassengerAgent[32089]: segfault at 0 ip 0000000000487ea3 sp 00007fff96027740 error 4 in PassengerAgent[400000+361000]
[ 7890.737343] PassengerAgent[32310]: segfault at 0 ip 0000000000487ea3 sp 00007fffcf2418c0 error 4 in PassengerAgent[400000+361000]
[ 7891.233598] PassengerAgent[32537]: segfault at 0 ip 0000000000487ea3 sp 00007ffc84a530c0 error 4 in PassengerAgent[400000+361000]
[ 7891.567982] PassengerAgent[32760]: segfault at 0 ip 0000000000487ea3 sp 00007fffadbb6600 error 4 in PassengerAgent[400000+361000]
[ 7891.997038] PassengerAgent[32983]: segfault at 0 ip 0000000000487ea3 sp 00007ffd19ac7700 error 4 in PassengerAgent[400000+361000]

Hoping this is enough of a clue that someone recognizes the problem or has a pointer to where I might look first for what has gone off the rails.

Thanks,

griznog

After some more digging I’m still pretty clueless why it suddenly started this. For affected users Passenger is segfaulting hundreds of times per minute, to the point that the crash dumps filled up /tmp which triggered a different error for the users that weren’t affected by the segfaults.

I don’t know if the PassengerAgent segfault’s are a red herring, or if they are, they’re a clue to the problem.

In version 1.6 it seems that we’re using Passenger 5.3.7. Somewhere along the way, between 5.0.x and 5.3.7 Passenger started using just “Passenger” instead of “PassengerAgent” (what you see there)

When I do a ps -elf on a 1.6 OOD installation I see just ‘Passenger watchdog’ and ‘Passenger core’ process’. I’m guessing you may see ‘PassangerAgent xyz’

Can you check your uptime to see if those dmesg messages where indeed from today (awk ‘{print $1}’ /proc/uptime)

If 1.6 OOD is booting 1.3 Passenger binaries then it seems that’s likely the cause. I’d look to what your /opt/rh/ondemand/root/sbin/ and /opt/rh/ondemand/root/bin directories look like, if they where updated.

OK - it seems the segfaults and dmesg’s are real and legit.

Clearly you’re loading old libraries from 1.3 nginx or passenger or both. @efranz will have more for you because I’m not sure what directories/files you’ll have to shift around or remove.

@griznog please see the directions https://osc.github.io/ood-documentation/release-1.6/release-notes/v1.5-release-notes.html#upgrading-from-v1-3. There are steps that are relevant for upgrading to 1.6 from 1.3.

In 1.5 we upgraded to Passenger 5 because the Passenger 4 SCL was end of life. Unfortunately, Software Collections dropped support entirely for Passenger 5. Since NGINX has to be compiled with the Passenger plugin code available, we had to build our own SCL compatible packages that included Passenger 5 and NGINX. So directories like /var/lib/nginx change to /var/lib/ondemand-nginx.

Follow those upgrade from 1.3 to 1.5 instructions, except use the 1.6 rpm location.

I should put a warning on the 1.6 upgrade directions to “follow 1.5 upgrade directions first if upgrading from 1.3” or provide upgrade directions from 1.3 to 1.6 in the release notes. Sorry about that.

Of course if you did all that and you are still experiencing these problems, then let me know and we’ll figure it out.

My nodes are diskless/stateless, so when I say “upgrade” OnDemand what I really mean is that the node rebooted and I did a complete new install (which is scripted) onto a base OS image that has no history of ondemand related packages on it. That install should have pointed at the ondemand 1.6 repo mirror I have and I looked all that over and don’t see how it could have picked up older rpms or a different mirror. When I rebooted with OnDemand 1.3 everything worked as expected. Updated my repo mirror of ondemand-1.6 and scl, just to be safe but there was nothing to update for ondemand-1.6.

After verifying that it has to install from the correct repo, I rebooted, triggering a fresh install of 1.6. The yum logs for what installed are here:

After that install, I immediately get the

502 Bad Gateway
nginx/1.14.0

and in my logs (

--------------------------------------
[ pid=76054 ] **************** LOOK HERE FOR CRASH DETAILS *****************

[ pid=76054 ] Crash log dumped to this directory:
[ pid=76054 ] /var/tmp/passenger-crash-log.1563635753.jfhgfE

[ pid=76054 ] **************** LOOK ABOVE FOR CRASH DETAILS ****************
[ W 2019-07-20 08:15:54.6008 54366/T4 age/Wat/AgentWatcher.cpp:97 ]: Passenger core (pid=76054) crashed with signal SIGSEGV, restarting it...
[ N 2019-07-20 08:15:54.6297 76282/T1 age/Cor/CoreMain.cpp:1310 ]: Starting Passenger core...
[ N 2019-07-20 08:15:54.6299 76282/T1 age/Cor/CoreMain.cpp:256 ]: Passenger core running in multi-application mode.
[ N 2019-07-20 08:15:54.9182 76282/T1 age/Cor/CoreMain.cpp:985 ]: Passenger core online, PID 76282

[ pid=76282, timestamp=1563635754 ] Process aborted! signo=SIGSEGV(11), reason=SEGV_MAPERR, si_addr=0x2d0, randomSeed=1563635754
[ pid=76282 ] Crash log files will be dumped to /var/tmp/passenger-crash-log.1563635754.QeECB2 <--- ******* LOOK HERE FOR DETAILS!!! *******
[ pid=76282 ] Date and uname:
Sat Jul 20 08:15:55 PDT 2019
Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64
[ pid=76282 ] Phusion Passenger version: 5.3.7
[ pid=76282 ] libc backtrace available!
--------------------------------------
[ pid=76282 ] Backtrace with 13 frames:
ERROR: cannot execute '/opt/ood/nginx_stage/bin/ruby "/opt/ood/ondemand/root/usr/share/passenger/helper-scripts/backtrace-sanitizer.rb"' for sanitizing the backtrace, writing to stderr directly...
Passenger core[0x487c8d]
Passenger core[0x4882c2]
Passenger core[0x4896e1]
/lib64/libpthread.so.0(+0xf5d0)[0x2ba12d62f5d0]
/lib64/libpthread.so.0(pthread_detach+0x0)[0x2ba12d628e30]
Passenger core(_ZN5boost6thread6detachEv+0x69)[0x656309]
Passenger core[0x440e05]
Passenger core[0x4f3721]
Passenger core[0x500629]
Passenger core[0x501bed]
Passenger core[0x486297]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ba12ed7c3d5]
Passenger core[0x432ec1]
--------------------------------------
[ pid=76282 ] Dumping ulimits...
Dumping to /var/tmp/passenger-crash-log.1563635754.QeECB2/ulimits.log
--------------------------------------
[ pid=76282 ] Dumping OXT backtraces...
Dumping to /var/tmp/passenger-crash-log.1563635754.QeECB2/backtrace_oxt.log
--------------------------------------
[ pid=76282 ] Dumping controller states...
Dumping to /var/tmp/passenger-crash-log.1563635754.QeECB2/controller_states.log
--------------------------------------
[ pid=76282 ] Dumping controller configs...
Dumping to /var/tmp/passenger-crash-log.1563635754.QeECB2/controller_configs.log
--------------------------------------
[ pid=76282 ] Dumping pool state...
Dumping to /var/tmp/passenger-crash-log.1563635754.QeECB2/pool.log
--------------------------------------
[ pid=76282 ] Dumping mbuf statistics...
Dumping to /var/tmp/passenger-crash-log.1563635754.QeECB2/mbufs.log
--------------------------------------
[ pid=76282 ] Open files and file descriptors:
Dumping to /var/tmp/passenger-crash-log.1563635754.QeECB2/fds.log
--------------------------------------
[ pid=76282 ] Dumping a more detailed backtrace with crash-watch...
Dumping to /var/tmp/passenger-crash-log.1563635754.QeECB2/backtrace.log

and in dmesg, lot’s of

[ 1427.403090] PassengerAgent[17639]: segfault at 0 ip 0000000000487ea3 sp 00007fff5812e140 error 4 in PassengerAgent[400000+361000]
[ 1428.619243] PassengerAgent[17866]: segfault at 0 ip 0000000000487ea3 sp 00007ffe38d09740 error 4 in PassengerAgent[400000+361000]
[ 1429.849809] PassengerAgent[18093]: segfault at 0 ip 0000000000487ea3 sp 00007fffdbfec380 error 4 in PassengerAgent[400000+361000]
[ 1431.059564] PassengerAgent[18320]: segfault at 0 ip 0000000000487ea3 sp 00007ffc735b9440 error 4 in PassengerAgent[400000+361000]
[ 1432.287969] PassengerAgent[18547]: segfault at 0 ip 0000000000487ea3 sp 00007fff7d6ea4c0 error 4 in PassengerAgent[400000+361000]
[ 1433.488811] PassengerAgent[18774]: segfault at 0 ip 0000000000487ea3 sp 00007ffee3f765c0 error 4 in PassengerAgent[400000+361000]
[ 1434.735127] PassengerAgent[19003]: segfault at 0 ip 0000000000487ea3 sp 00007fff76c7ac80 error 4 in PassengerAgent[400000+361000]
[ 1435.976438] PassengerAgent[19233]: segfault at 0 ip 0000000000487ea3 sp 00007ffd1281a080 error 4 in PassengerAgent[400000+361000]

PassengerAgent is starting from somewhere, lots of time and immediately becoming a zombie:

# ps aux | grep Pass
griznog   30002  0.0  0.0 10483944 19528 ?      tl   08:23   0:00 Passenger core
griznog   30201  0.0  0.0      0     0 ?        Z    08:23   0:00 [PassengerAgent] <defunct>
griznog   30202  0.0  0.0 10483944 16848 ?      S    08:23   0:00 Passenger core
root      30227  0.0  0.0 112712   976 pts/0    R+   08:23   0:00 grep --color=auto Pass
griznog   54366  1.5  0.0 386428 10120 ?        Ssl  08:14   0:08 Passenger watchdog

Double checking the installed packages:

# rpm -qa | grep passenger
ondemand-passenger-5.3.7-2.el7.x86_64

/opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent exists:

# rpm -qif /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent
Name        : ondemand-passenger
Version     : 5.3.7
Release     : 2.el7
Architecture: x86_64
Install Date: Sat 20 Jul 2019 08:13:02 AM PDT
Group       : System Environment/Daemons
Size        : 7249935
License     : Boost and BSD and BSD with advertising and MIT and zlib
Signature   : RSA/SHA1, Tue 26 Feb 2019 11:37:45 AM PST, Key ID 4b72fe2b92d31755
Source RPM  : ondemand-passenger-5.3.7-2.el7.src.rpm
Build Date  : Tue 26 Feb 2019 11:26:02 AM PST
Build Host  : 0901e7b904bd
Relocations : (not relocatable)
Packager    : Phusion
Vendor      : Phusion
URL         : https://www.phusionpassenger.com
Summary     : Phusion Passenger application server
Description :
Phusion Passenger® is a web server and application server, designed to be fast,
robust and lightweight. It takes a lot of complexity out of deploying web apps,
adds powerful enterprise-grade features that are useful in production,
and makes administration much easier and less complex. It supports Ruby,
Python, Node.js and Meteor.

And PassengerAgent does show up in a few places.

# grep -r -i PassengerAgent /opt/ood/
/opt/ood/ondemand/root/usr/share/ruby/vendor_ruby/phusion_passenger/constants.rb:    AGENT_EXE                 = "PassengerAgent"
/opt/ood/ondemand/root/usr/share/ruby/vendor_ruby/phusion_passenger/admin_tools/memory_stats.rb:          /((^| )Passenger|(^| )Rails:|(^| )Rack:|wsgi-loader.py|(.*)PassengerAgent|rack-loader.rb)/)
Binary file /opt/ood/ondemand/root/usr/sbin/nginx matches
Binary file /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/nginx-1.14.0 matches

Thanks,

griznog

Also, on my other OnDemand server, which uses the same base OS image and setup scripts to install fresh on reboot and which is not crashing or having any issues:

# ps aux | grep Pass
griznog  136498  0.0  0.0 386428  8016 ?        Ssl  08:42   0:00 Passenger watchdog
griznog  136501  4.5  0.0 10500964 22304 ?      Sl   08:42   0:06 Passenger core
root     138783  0.0  0.0 112716   980 pts/1    S+   08:44   0:00 grep --color=auto Pass
root     269688  0.0  0.0 217324  2016 ?        Ssl  Jul14   0:00 PassengerWatchdog
root     269691  0.0  0.0 507748  4800 ?        Sl   Jul14   5:46 PassengerHelperAgent
nobody   269698  0.0  0.0 233024  4364 ?        Sl   Jul14   0:03 PassengerLoggingAgent

Packages are the same.

WAG: The working server is Intel and the broken one is AMD, is it possible the rpms were built on an Intel machine and the compiler optimized for CPU so that this segfaults on AMD?

griznog

@griznog Is it possible there are differences in the two servers besides AMD and Intel CPUs? I know I’ve had issues in past with stateless where I add a new application to the system and forget to add a path to be stateful and the program segfaults because it doesn’t know how to handle the inability to write out certain files. In our case this is usually with /etc/rwtab and /etc/statetab on RHEL systems.

If the stateless environment is identical then I think the next thing to do is dig into how Passenger is being built. We use the upstream Passenger RPM build workflow that was modified to put Passenger into OnDemand specific SCL path, the actual build process is buried somewhere in the Passenger repo but something I can likely dig out of build logs.

Can you attach some/all of those crash logs?

It seems this is the line you’re seg-faulting at, constructing a string? But I can’t tell what’s calling it, which is why the stack trace in the crash logs would be helpful.

[root@4cd7ca4b7c5c ~]# addr2line -e /opt/rh/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent -fCi 487ea3
char* std::string::_S_construct<char*>(char*, char*, std::allocator<char> const&, std::forward_iterator_tag)
??:?

I still don’t any of those PassangerAgent process’ - defunct or not.

[root@4cd7ca4b7c5c ~]# ps aux | grep Pass
jeff       101  0.0  0.0 365792  4368 ?        Ssl  Jul19   0:01 Passenger watchdog
jeff       104  0.0  0.0 1424124 6212 ?        Sl   Jul19   0:17 Passenger core
root      1014  0.0  0.0  10700  1664 pts/0    S+   14:48   0:00 grep --color=auto Pass

@tdockendorf Both systems boot from the same image, then each gets its individual configuration applied so they diverge with respect to OnDemand only in the config files, SSL certs and apps that have been added. If it could be an issue with the apps, then that could be triggering it but the problem doesn’t always happen and doesn’t happen to all users. I deployed the apps to my user sandbox on the Intel server with OnDemand 1.6 and haven’t seen any issues from it (yet.) A quick grep through all the apps deployed doesn’t turn up PassengerAgent anywhere either. The gorilla in the room is the CPU, which complicates testing because the Intel box is my dev/testing area and the AMD is production and I may only be a few testing/debugging reboots away from a crowd of users with pitchforks and torches.

I’ve placed an example crash here temporarily:

http://public.scg.stanford.edu/griznog/passenger-crash-log.1563568452.eynEiv.tar.gz

The only place in there I see PassengerAgent is

# cat fds.log 
COMMAND     PID  USER   FD      TYPE             DEVICE SIZE/OFF      NODE NAME
Passenger 91834 edsun  cwd       DIR               0,19      860     26767 /
Passenger 91834 edsun  rtd       DIR               0,19      860     26767 /
Passenger 91834 edsun  txt       REG               0,19  3643392    810110 /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent
Passenger 91834 edsun  mem       REG               0,19   163400     56900 /usr/lib64/ld-2.17.so
Passenger 91834 edsun  mem       REG               0,19    43776     59116 /usr/lib64/librt-2.17.so
Passenger 91834 edsun  mem       REG               0,19   141968     58824 /usr/lib64/libpthread-2.17.so
Passenger 91834 edsun  mem       REG               0,19   115848     61923 /usr/lib64/libnsl-2.17.so
...

I don’t speak/read Passenger crash dump though…

@griznog The host where I’ve built the RPMs is Intel Haswell and the RPMs are built inside a docker container that uses mock. I re-ran the build and all instances of -mtune are using -mtune=generic.

Are the AMD systems very new AMD CPUs? The RPM build process appears to be using mostly RedHat defaults for compiling during RPM build and it’s possible those defaults could be incompatible with some very new CPUs.

The node has 2 x AMD EPYC 7401 that are a year old. -mtune=generic is how I fix this problem for things I build locally, so that should eliminate this as a possible cause. Also CPU instruction related segfaults usually say it’s CPU instruction related explicitly and error 4 sort of implies an instruction fetch problem but maybe not? I did a crude/naive objdump and some grepping and didn’t see anything in PassengerAgent that looks too fancy to run on the EPYC CPUs, but I should really stress the naive in that attempt.

Is it possible to rebuild these with a straightforward rpmbuild ...? If so I can rebuild them on the AMD box and check for differences.

Looking at the backtrace_oxt.log, you don’t seem to be doing much. Just polling and waiting around for work.

My only next guess could be https://github.com/phusion/passenger/issues/2089 because the only thread doing any work is the security updater, GCer, analytics and the main function. Do you have outbound connectivity from your production servers? Specifically to https://securitycheck.phusionpassenger.com:443. Though I’m not able to replicate a firewall issue (timoeouts). It doesn’t seem to have a problem with connections directly being refused though.

If not, you can try disabling the config with this
passenger_disable_security_update_check off
to the file
/opt/ood/ondemand/root/etc/nginx/conf.d/passenger.conf

Thanks for the pointer, there are no outbound firewalls in play and I can wget that url without any problems, but at my next 1.6 attempt I will disable this check and see what happens.

@griznog Unfortunately the RPM build workflow we adopted from Passenger requires docker. You may be able to rebuild the SRPM with rpmbuild but would require your system have necessary SCL installed.

With docker:

git clone https://github.com/OSC/ondemand-packaging.git
cd ondemand-packaging
./build.sh -w /tmp/work -o /tmp/output -t build:passenger_nginx -d el7 $(pwd)/build/passenger-nginx.py

The RPMs will be in /tmp/output/el7.

Without docker (tested inside docker)

yum install libcurl-devel ondemand-runtime ondemand-build ondemand-scldevel ondemand-ruby \
  ondemand-apache rh-ror50-rubygem-rack zlib-devel pcre-devel openssl-devel httpd24-build
yum install @buildsys-build
mkdir -p ~/rpmbuild/SOURCES
cd ~/rpmbuild/SOURCES
wget -O ondemand-passenger-5.3.7-2.el7.src.rpm https://yum.osc.edu/ondemand/1.6/web/el7/SRPMS/ondemand-passenger-5.3.7-2.el7.src.rpm
rpm2cpio ondemand-passenger-5.3.7-2.el7.src.rpm | cpio -dimv
cd ~
rpmbuild -bs --define 'scl ondemand' ~/rpmbuild/SOURCES/passenger.spec

I rebuilt locally on an AMD system and the resulting binaries are different but the instruction set of each is the same. I’m going to make a test attempt with my build if nothing else turns up as a likely source of the problem.

@griznog did either the rebuild or the update disabling work? I’ve been searching the passengers’ github for other likely sources but could not find any. All of the things you’ve listed here indicate you’re running the same passenger version we are everywhere else (5.3.7), so we’ll have to find where else your system is divergent.

The stacks are in string allocations, do you have any LD_PRELOAD by some off chance? These base images, are they fairly ‘base’, I mean like do they have other versions of httpd or nginx, things like that. Can you share your install scripts/procedures?

I won’t get a chance to test the AMD-built rpms and update disabling until the weekend.

Sharing my home grown config scripts is potentially embarrassing, but I can probably handle the self-esteem blow of showing the ondemand specific bits in pseudocode. The basic flow is

Warewulf is used to provision/boot a stateless VNFS. This VNFS is pretty heavy and between the VNFS and the config process by the time it gets to the ondemand step it has a fully installed CentOS with Gnome, KDE and XFCE and a wide assortment of relevant (to a bioinformatics oriented cluster) tool and devel packages.

The ondemand install does, basically (bash-like pseudo code):

# install if not here already 
rpm -q ondemand || yum -y install --nogpgcheck ${ondemand_repos} ondemand

# get all the config files we have modified from the defaults
for file in $list_of_config_files; do
  wget -O $file http://myconfighost/configs/$file
done

# get our local apps.
pushd /var/www/ood/apps/sys
for app in ${list_of_apps}; do
  svn co https://subversionserver/repos/apps/${app} 
done

# (re)start httpd24
/opt/ood/ood-portal-generator/sbin/update_ood_portal
systemctl enable httpd24-httpd.service httpd24-htcacheclean.service
systemctl stop httpd24-httpd.service httpd24-htcacheclean.service
sleep 4
systemctl start httpd24-httpd.service httpd24-htcacheclean.service

# This was configured earlier, restart here just to be safe.
systemctl restart shibd.service

On my test/dev ondemand host the current running config is after having it boot with 1.3 and then live-upgrade to 1.6 so it does have some old packages lying around, e.g.

# rpm -qa | grep -i passenger | sort
ondemand-passenger-5.3.7-2.el7.x86_64
rh-passenger40-2.0-8.el7.x86_64
rh-passenger40-libeio-4.19-3.el7.x86_64
rh-passenger40-libev-4.15-6.el7.x86_64
rh-passenger40-mod_passenger-4.0.50-9.el7.x86_64
rh-passenger40-passenger-4.0.50-9.el7.x86_64
rh-passenger40-rubygem-daemon_controller-1.2.0-2.el7.noarch
rh-passenger40-runtime-2.0-8.el7.x86_64

But does not suffer from the problems affecting the other ondemand host. This system only sees my testing though so if the problem is triggered by load I’d never see it here. Looking at my logs from a clean boot with 1.6 install, there are no other passenger versions getting pulled in, just the ondemand-passenger-5.3.7-2.el7.x86_64 package.

It occurred to me that perhaps having the old version of passenger there was somehow making it work, so I removed it from the test server but still can’t reproduce the problem there.

Continuing my random walk, I replaced the upstream version on my intel test server with my locally compiled-on-AMD-cpu version (just the ondemand-passenger rpm) and everything still worked on the Intel test server.

I replaced /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent with a shell script to dump whatever args it is called with to logger then exec the real binary with the args, e.g.,

#!/bin/bash
echo $0 $* | logger -t OOD
exec /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent.bin $*

Which results in this series of logged calls to PassengerAgent:

Jul 24 11:20:56 ondemandhost OOD: /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent watchdog
Jul 24 11:20:56 ondemandhost OOD: /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent core
Jul 24 11:20:56 ondemandhost OOD: /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent spawn-env-setupper /tmp/passenger.spawn.XXXXRlV65A --before
Jul 24 11:20:56 ondemandhost OOD: /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent spawn-env-setupper /tmp/passenger.spawn.XXXXRlV65A --after

Hopefully something in there is a clue? I’m going to do some more testing on the problem server this weekend, a clean install of 1.6 and then I’ll do a similar random walk trying the update disable, AMD compiled rpm and if none of that works I’ll try to capture how PassengerAgent is called and maybe can at least narrow it down to one of those invocations that is the problem.

~griznog