I won’t get a chance to test the AMD-built rpms and update disabling until the weekend.
Sharing my home grown config scripts is potentially embarrassing, but I can probably handle the self-esteem blow of showing the ondemand specific bits in pseudocode. The basic flow is
Warewulf is used to provision/boot a stateless VNFS. This VNFS is pretty heavy and between the VNFS and the config process by the time it gets to the ondemand step it has a fully installed CentOS with Gnome, KDE and XFCE and a wide assortment of relevant (to a bioinformatics oriented cluster) tool and devel packages.
The ondemand install does, basically (bash-like pseudo code):
# install if not here already
rpm -q ondemand || yum -y install --nogpgcheck ${ondemand_repos} ondemand
# get all the config files we have modified from the defaults
for file in $list_of_config_files; do
wget -O $file http://myconfighost/configs/$file
done
# get our local apps.
pushd /var/www/ood/apps/sys
for app in ${list_of_apps}; do
svn co https://subversionserver/repos/apps/${app}
done
# (re)start httpd24
/opt/ood/ood-portal-generator/sbin/update_ood_portal
systemctl enable httpd24-httpd.service httpd24-htcacheclean.service
systemctl stop httpd24-httpd.service httpd24-htcacheclean.service
sleep 4
systemctl start httpd24-httpd.service httpd24-htcacheclean.service
# This was configured earlier, restart here just to be safe.
systemctl restart shibd.service
On my test/dev ondemand host the current running config is after having it boot with 1.3 and then live-upgrade to 1.6 so it does have some old packages lying around, e.g.
# rpm -qa | grep -i passenger | sort
ondemand-passenger-5.3.7-2.el7.x86_64
rh-passenger40-2.0-8.el7.x86_64
rh-passenger40-libeio-4.19-3.el7.x86_64
rh-passenger40-libev-4.15-6.el7.x86_64
rh-passenger40-mod_passenger-4.0.50-9.el7.x86_64
rh-passenger40-passenger-4.0.50-9.el7.x86_64
rh-passenger40-rubygem-daemon_controller-1.2.0-2.el7.noarch
rh-passenger40-runtime-2.0-8.el7.x86_64
But does not suffer from the problems affecting the other ondemand host. This system only sees my testing though so if the problem is triggered by load I’d never see it here. Looking at my logs from a clean boot with 1.6 install, there are no other passenger versions getting pulled in, just the ondemand-passenger-5.3.7-2.el7.x86_64
package.
It occurred to me that perhaps having the old version of passenger there was somehow making it work, so I removed it from the test server but still can’t reproduce the problem there.
Continuing my random walk, I replaced the upstream version on my intel test server with my locally compiled-on-AMD-cpu version (just the ondemand-passenger rpm) and everything still worked on the Intel test server.
I replaced /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent
with a shell script to dump whatever args it is called with to logger
then exec the real binary with the args, e.g.,
#!/bin/bash
echo $0 $* | logger -t OOD
exec /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent.bin $*
Which results in this series of logged calls to PassengerAgent:
Jul 24 11:20:56 ondemandhost OOD: /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent watchdog
Jul 24 11:20:56 ondemandhost OOD: /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent core
Jul 24 11:20:56 ondemandhost OOD: /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent spawn-env-setupper /tmp/passenger.spawn.XXXXRlV65A --before
Jul 24 11:20:56 ondemandhost OOD: /opt/ood/ondemand/root/usr/lib64/passenger/support-binaries/PassengerAgent spawn-env-setupper /tmp/passenger.spawn.XXXXRlV65A --after
Hopefully something in there is a clue? I’m going to do some more testing on the problem server this weekend, a clean install of 1.6 and then I’ll do a similar random walk trying the update disable, AMD compiled rpm and if none of that works I’ll try to capture how PassengerAgent is called and maybe can at least narrow it down to one of those invocations that is the problem.
~griznog