Problem with new installation

I have a new install of OOD v4.0.8 which I am installing from scratch on a RHEL9.6 – which I am performing strictly following the instructions at Installation — Open OnDemand 4.0.0 documentation (besides authentication, which is done with LDAP/Dex in the same way I did for an older version which works just fine).

The authentication seems to be working (wrong user/passwd are rejected, good ones are let in), however upon successful login I get

Looking at /var/log/httpd/<URL redacted>_error_ssl.log I see the following, which does have a few strange things

[Tue Nov 25 17:07:42.553615 2025] [auth_openidc:warn] [pid 1123068:tid 1123244] [client 10.1.1.254:59584] oidc_check_x_forwarded_hdr: header X-Forwarded-Proto received but OIDCXForwardedHeaders not configured for it, referer: https://<URL redacted>:8443/dex/auth/ldap/login?back=&state=guac4ceuxybfmwhu4coapxr7t
[Tue Nov 25 17:07:42.553866 2025] [proxy:error] [pid 1123068:tid 1123244] (2)No such file or directory: AH02454: http: attempt to connect to Unix domain socket /var/run/ondemand-nginx/davide/passenger.sock (localhost:80) failed
[Tue Nov 25 17:07:42.553885 2025] [proxy_http:error] [pid 1123068:tid 1123244] [client 10.1.1.254:59584] AH01114: HTTP: failed to make connection to backend: localhost, referer: https://<URL redacted>:8443/dex/auth/ldap/login?back=&state=guac4ceuxybfmwhu4coapxr7t
[Tue Nov 25 17:07:42.554821 2025] [lua:info] [pid 1123068:tid 1123244] [client 10.1.1.254:59584] allowed_hosts="<URL redacted>" req_is_https="true" req_status="503" req_hostname="<URL redacted>" res_content_location="" local_user="davide" req_port="8443" req_referer="<URL redacted>:8443/dex/auth/ldap/login" req_method="GET" time_user_map="1.997" res_content_disp="" remote_user="Davide" req_origin="" res_content_type="text/html; charset=iso-8859-1" log_time="2025-11-25T23:07:42.554730.0Z" req_accept_language="en-us,en;q=0.5" req_content_type="" req_filename="proxy:http://localhost/pun/sys/dashboard" req_user_agent="Mozilla/5.0 (X11; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0" time_proxy="1.182" res_location="" req_uri="/pun/sys/dashboard" res_content_encoding="" res_content_length="299" req_server_name="<URL redacted>" res_content_language="" req_cache_control="" log_hook="ood" req_is_websocket="false" log_id="aSY2vkzTsf3yG2FKqZmhpgAAAJc" req_accept="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" req_handler="proxy:unix:/var/run/ondemand-nginx/davide/passenger.sock|http://localhost" req_accept_charset="" req_accept_encoding="gzip, deflate, br, zstd" req_protocol="HTTP/1.1" req_user_ip="10.1.1.254", referer: <URL redacted>:8443/dex/auth/ldap/login?back=&state=guac4ceuxybfmwhu4coapxr7t

namely:

  • it uses localhost rather than the actual URL in a few places
  • has my username with an uppercase (but I suspect that is an ActiveDirectory-ism and I have seen it in other contexts too where it did not cause issues there)
  • tries to open a socket against port 80 which fails (if I point my browser to that URL that gets redirected to port 8443 and is not found either)
  • /var/run/ondemand-nginx/ exists but does not have any subdirectory under it

I suspect I missed a step in the instructions I followed, because looking ahead in reference and tutorial sections they seem to imply that the /etc/ood/config/ondemand.d/ and /etc/ood/config/apps/ exist (for example at PUN environment — Open OnDemand 4.0.0 documentation and ondemand.d/*.yml files — Open OnDemand 4.0.0 documentation ) whereas those paths do not exist in my install:

On the other hand /var/www/ood/apps/ does exist and appears to be populated by the usual stuff

So I am not sure what may be wrong here?

So nobody can help me with this? I have not been able to make any progress. I tried several things, including:

  • upgraded Apache (which released a new version since I initially posted about this issue and a RHEL post suggested it could been the culprit)
  • various permissions, such as POSIX writing permissions in that directory, private settings of tmp directories in Apache and more
  • looking at all the logs I could find
  • running the following (which I saw in /var/log/secure is happening at that time) completes without message and exit with code 0
[root ~] # tail -3 /var/log/secure
Jan  8 21:41:11 hostname sudo[3727779]:  apache : PWD=/ ; USER=root ; COMMAND=/opt/ood/nginx_stage/sbin/nginx_stage pun -u davide -a https%3a%2f%2fURL_REDACTED%3a8443%2fnginx%2finit%3fredir%3d%24http_x_forwarded_escaped_uri
Jan  8 21:41:11 hostname sudo[3727779]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=48)
Jan  8 21:41:11 hostname sudo[3727779]: pam_unix(sudo:session): session closed for user root

[root ~] # /opt/ood/nginx_stage/sbin/nginx_stage pun -u davide -a https%3a%2f%2fURL_REDACTED%3a8443%2fnginx%2finit%3fredir%3d%24http_x_forwarded_escaped_uri
[root ~] # echo $?
0
[root ~] # 

None of these change a thing, I always get that Service Unavailable message in the browser and that extremely annoying No such file or directory error in the logs, and most frustratingly no clue why!

Reading about similar problem online such as this or this other suggests that some service may be not installed or not configured correctly or something. However I cannot find anything relevant missing (those reports are for PHP which obviously is not applicable here), unless there is something missing in the OnDemand documentation.

For example what should create that /var/run/ondemand-nginx/davide/passenger.sock, and when? Not the php-fpm service obviously. I can’t find any information about that. Does anybody here know? Is there a way that I could crank the verbosity of the logs to the maximum to get any clue why that socket is not created?

Sorry for the long delay. Booting the PUN should create that domain socket you’ve listed there. If the file isn’t there, then I’d guess the PUN doesn’t boot.

I’d check if you have selinux running and any messages in /var/log/audit that prohibit nginx from starting up. Nginx should be creating this file to listen on.

Thanks Jeff, this is a big clue!

SELinux has always been off on this machine (and the first thing I double checked).

In the audit files there are some messages which I am not sure I understand, but they are all success so maybe not concerning?

type=USER_END msg=audit(1767974382.737:19726): pid=3775812 uid=48 auid=4294967295 ses=4294967295 subj=kernel msg='op=PAM:session_close grantors=pam_keyinit,pam_limits,pam_systemd,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success'UID="apache" AUID="unset"
type=CRED_DISP msg=audit(1767974382.737:19727): pid=3775812 uid=48 auid=4294967295 ses=4294967295 subj=kernel msg='op=PAM:setcred grantors=pam_env,pam_localuser,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success'UID="apache" AUID="unset"
type=USER_ACCT msg=audit(1767974382.746:19728): pid=3775819 uid=48 auid=4294967295 ses=4294967295 subj=kernel msg='op=PAM:accounting grantors=pam_unix,pam_localuser acct="apache" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success'UID="apache" AUID="unset"
type=USER_CMD msg=audit(1767974382.746:19729): pid=3775819 uid=48 auid=4294967295 ses=4294967295 subj=kernel msg='cwd="/" cmd=2F6F70742F6F6F642F6E67696E785F73746167652F7362696E2F6E67696E785F73746167652070756E202D752068353035343935202D612068747470732533612532662532666C65746F2D6F642E686F6E657977656C6C2E6C6162253361383434332532666E67696E78253266696E69742533667265646972253364253234687474705F785F666F727761726465645F657363617065645F757269 exe="/usr/bin/sudo" terminal=? res=success'UID="apache" AUID="unset"
type=CRED_REFR msg=audit(1767974382.747:19730): pid=3775819 uid=48 auid=4294967295 ses=4294967295 subj=kernel msg='op=PAM:setcred grantors=pam_env,pam_localuser,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success'UID="apache" AUID="unset"
type=USER_START msg=audit(1767974382.763:19731): pid=3775819 uid=48 auid=4294967295 ses=4294967295 subj=kernel msg='op=PAM:session_open grantors=pam_keyinit,pam_limits,pam_systemd,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=? res=success'UID="apache" AUID="unset"

Instead, I can tell that I did not edit anything in /etc/ood/config/nginx_stage.yml which is basically all commented out. Could that be the root cause of the problem? If not, how do I check if PUN did boot, and how do I attempt to force it booting, so that I can see whatever problem it encountered?

Thanks!

If you issue the command to start the PUN manually - or try through apache, you can check to see if it’s running through ps searches.

Beyond that - I would check /var/log/ondemand-nginx/$USER/error.log and /var/log/ondemand-nginx/error.log for any error output.

So that’s clearly the problem then:

[root@node30 ~]# ps aux | grep pun
root     3781997  0.0  0.0   3876  1792 pts/0    S+   10:28   0:00 grep --color=auto pun
[root@node30 ~]# ps aux | grep -i pun
root     3782015  0.0  0.0   3876  1792 pts/0    S+   10:28   0:00 grep --color=auto -i pun
[root@node30 ~]# ps aux | grep -i nginx
root     3782019  0.0  0.0   3876  1792 pts/0    S+   10:29   0:00 grep --color=auto -i nginx
[root@node30 ~]# ll /var/log/ondemand-nginx/
total 0

I’m totally lost here: the last time I installed OnDemand was v1.8, IIRC, and many things have changed. So I am following the instructions at Installation — Open OnDemand 4.0.0 documentation and I think I followed them diligently.

I have the services for httpd and ondemand-dex started and they are running. Their logs report:

  • running pun as I wrote above (with that nginx_stage command) and
  • correctly find the users in LDAP (in fact I can login with the correct AD username and password but not connect if I intentionally mispell either or both)

As I wrote before, if I run /opt/ood/nginx_stage/sbin/nginx_stage pun -u davide -a https%3a%2f%2fURL_REDACTED%3a8443%2fnginx%2finit%3fredir%3d%24http_x_forwarded_escaped_uri it does absolutely nothing and exit immediately with exit code 0 (and leaves nothing running behind)

What am I missing?

Thanks again, Jeff

OK, having nothing in /var/log/ondemand-nginx is very strange indeed.

Now I guess we have to go lower in dmesg or journalctl to see if there’s anything that the Linux kernel itself would be reporting on. I’m thinking maybe something like out of memory errors or similar. Like the process boots up correctly (becuase the exit status is 0) but then is quickly terminated by the kernel for some reason.

This is a beefy server and I’d be shocked if it were OOM’ing while basically doing nothing, but to double check…

[root@node30 ~]# free -h
               total        used        free      shared  buff/cache   available
Mem:           503Gi       7.0Gi       384Gi       284Mi       115Gi       496Gi
Swap:          9.3Gi          0B       9.3Gi

No traces of OOMs anywhere in the kernel logs, and nothing timestamped as today in dmesg

Very light logging in messages which seems irrelevant to me.

[root@node30 ~]# tail -30 /var/log/messages
Jan  9 09:57:47 node30 systemd[1]: session-c105.scope: Deactivated successfully.
Jan  9 10:00:02 node30 systemd[1]: Starting system activity accounting tool...
Jan  9 10:00:02 node30 systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan  9 10:00:02 node30 systemd[1]: Finished system activity accounting tool.
Jan  9 10:10:02 node30 systemd[1]: Starting system activity accounting tool...
Jan  9 10:10:02 node30 systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan  9 10:10:02 node30 systemd[1]: Finished system activity accounting tool.
Jan  9 10:20:02 node30 systemd[1]: Starting system activity accounting tool...
Jan  9 10:20:02 node30 systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan  9 10:20:02 node30 systemd[1]: Finished system activity accounting tool.
Jan  9 10:30:02 node30 systemd[1]: Starting system activity accounting tool...
Jan  9 10:30:02 node30 systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan  9 10:30:02 node30 systemd[1]: Finished system activity accounting tool.
Jan  9 10:35:54 node30 systemd[1]: Starting dnf makecache...
Jan  9 10:35:54 node30 dnf[3782497]: Updating Subscription Management repositories.
Jan  9 10:35:55 node30 dnf[3782497]: Metadata cache refreshed recently.
Jan  9 10:35:55 node30 systemd[1]: dnf-makecache.service: Deactivated successfully.
Jan  9 10:35:55 node30 systemd[1]: Finished dnf makecache.
Jan  9 10:40:02 node30 systemd[1]: Starting system activity accounting tool...
Jan  9 10:40:02 node30 systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan  9 10:40:02 node30 systemd[1]: Finished system activity accounting tool.
Jan  9 10:50:02 node30 systemd[1]: Starting system activity accounting tool...
Jan  9 10:50:02 node30 systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan  9 10:50:02 node30 systemd[1]: Finished system activity accounting tool.
Jan  9 11:00:02 node30 systemd[1]: Starting system activity accounting tool...
Jan  9 11:00:02 node30 systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan  9 11:00:02 node30 systemd[1]: Finished system activity accounting tool.
Jan  9 11:10:02 node30 systemd[1]: Starting system activity accounting tool...
Jan  9 11:10:02 node30 systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan  9 11:10:02 node30 systemd[1]: Finished system activity accounting tool.

On the other hand journalctl does provide some info. Most of that is regular things such as ondemand-dex[3726393]: time=2026-01-09T08:59:23.120-07:00 level=INFO msg="performing ldap search some other is the same pun command I saw and pasted earlier (twice already, so I’m not copy-pasting that part again) with that PAM session opened and closed not even one second apart.

The only new information that I see is this one

Jan 09 11:00:01 node30.cluster CROND[3784168]: (root) CMD ([ -f /opt/ood/nginx_stage/sbin/nginx_stage ] && /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean 2>&1 | logger -t nginx_clean)
Jan 09 11:00:01 node30.cluster CROND[3784167]: (root) CMDEND ([ -f /opt/ood/nginx_stage/sbin/nginx_stage ] && /opt/ood/nginx_stage/sbin/nginx_stage nginx_clean 2>&1 | logger -t nginx_clean)

which appears to be repeated every couple of hours.

Now since also this seems to be a dead end, I opened that nginx_stage file and here it is!!

#!/usr/bin/env bash

# The purpose of this script is to wrap up the necessary environment for the
# per-user NGINX (PUN) processes to run under. The PUN requires the ondemand
# Software Collection.

# Root directory for this library
#
ROOT_DIR="$(dirname "$(dirname "$(readlink -f "${BASH_SOURCE[0]}")")")"

# Source in the default environment
# shellcheck source=/etc/profile
source "${ROOT_DIR}/etc/profile"

# Allow admin to override the environment the PUN runs in
#

NGINX_PROFILE=${NGINX_PROFILE:-/etc/ood/profile}
if [[ -f "${NGINX_PROFILE}" ]]; then
  # Source in the custom environment
  # shellcheck source=nginx_stage/etc/profile
  source "${NGINX_PROFILE}"
fi

# Environment is set, so call Ruby now
#
#exec \
#  /usr/bin/env ruby \
#    -I"${ROOT_DIR}/lib" \
#    -rnginx_stage \
#    -e "NginxStage::Application.start" \
#    -- "${@}"

This appears to be to be doing nothing? Is this how it is supposed to be? Why is the invocation to Ruby commented out? I did not commented out myself, I simply installed ondemand from the repos…

Thanks!

:laughing: that would do it. Does sudo rpm -V ondemand show it as having edits after install?

1 Like

Thanks I wonder how that happened, but you are definitely correct:

[root@co49svnode30 ~]# rpm -V ondemand | grep 5
S.5....T.  c /etc/httpd/conf.d/ood-portal.conf
..5....T.  c /etc/ood/config/ood_portal.sha256sum
S.5....T.  c /etc/ood/config/ood_portal.yml
S.5....T.    /opt/ood/nginx_stage/sbin/nginx_stage

After I uncommented the ruby invocation the MD5 matched and OOD now gets started, Huzzah!

Thanks so much for your time and patience!
Have a great weekend

Glad to hear you found the issue! Would not have thought to look for something like that!

1 Like

Scratch that last message if you saw it - I thought it was the systemd unit file, not the httpd conf file.