Apache's sudo call to start PUN times out

Hello from rainy Oslo, Norway! First of all, thank you for this amazing project! I’ve been setting up OnDemand for our local cluster and it’s been going well - but I’ve hit one problem that I’m having a hard time figuring out.

When a user logs in “cold” (no PUN running), there is always a ~30 second delay.

Long story short, it’s caused by something systemd-related timing out trying to sudo:

Feb 17 09:10:19 ood-dev01.educloud.no sudo[417843]:   apache : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/opt/ood/nginx_stage/sbin/nginx_stage pun -u ec-buzh -a https%3a%2f%2food.educloud.no%3a443%2fnginx%2finit
Feb 17 09:10:20 ood-dev01.educloud.no systemd[1]: Started Session c1 of user root.
Feb 17 09:10:45 ood-dev01.educloud.no sudo[417843]: pam_systemd(sudo:session): Failed to create session: Connection timed out
Feb 17 09:10:45 ood-dev01.educloud.no sudo[417843]: pam_unix(sudo:session): session opened for user root by (uid=0)
Feb 17 09:10:45 ood-dev01.educloud.no sudo[417843]: pam_unix(sudo:session): session closed for user root
Feb 17 09:10:46 ood-dev01.educloud.no httpd[65310]: oida + exec
Feb 17 09:10:46 ood-dev01.educloud.no httpd[65310]: + exec

As you can see, there is a 25 second delay from “Started Session” until the pam_systemd(sudo:session) timeout. Once the PUN is running, everything is fine - it’s just that sudo call that delays things.

The “oida + exec” is there because I added some debug output to pun_proxy.lua trying to pin down where it happened:

    err = nginx_stage.pun(r, pun_stage_cmd, user, app_init_url, pun_pre_hook_exports, pun_pre_hook_root_cmd)
    if err then
      r.usleep(1000000) -- sleep for 1 second before trying again
      print("oida" .. " " .. err)
    end

I was initially thinking it happened in the lua script, but it turns out it must be something in systemd-logind or something related. It looks very similar to this old systemd issue:

However, it’s not exactly the same. Also it seems nobody else on the internet ever had that happen with sudo:session (or at least that’s what Google says).

I tried adjusting dbus so that the timeouts that were around 25 seconds were shorter, hoping that it would help me further isolate the problem, but without luck - the sudo is still delayed by 25s every time.

Any ideas about what this could be?

I’m running RHEL 8.7 in vmware with OnDemand 2.0.29-1.

Best,
Andreas at the University of Oslo dept of Scientific Computing Services.

Hi and welcome!

Thank you for the detailed description.

I’m seeing the same results from google. That said, since it is a sudo:session I’m wondering if you can replicate with any of these commands?

sudo ls
sudo -u root ls

# or switch into root and issue sudo
sudo su - root 
sudo ls

It would seem the latter set of commands would really replicate. That is, the former are

pam_unix(sudo:session): session opened for user root by johrstrom(uid=0)

I.e., We need a sudo session opened for user root by user root, not the unprivileged user.

If that doesn’t replicate directly, here’s a suggestion on adding debugging in the OnDemand stack. Note that you should reset this all after you’re done. The idea here is to build some sort of wrapper script and execute that instead of nginx_stage directly.

Here’s the setting you need to modify.

pun_stage_cmd: /some/wrapper/that/can/strace

With a strace script that will strace sudo.

#!/bin/bash
# contents of  /some/wrapper/that/can/strace

strace -o /tmp/sudo_strace.out sudo /opt/ood/nginx_stage/sbin/nginx_stage $@

Note that you’ll have to modify your sudo rules to allow for this. Be sure to reset after you’ve completed debugging. Indeed you seem to be in the cloud, so I’d say re-roll the entire machine if you can.

https://osc.github.io/ood-documentation/latest/reference/files/ood-portal-yml.html

Thanks for the suggestions, Jeff.

Strace logs didn’t help me much, but I eventually figured out what was going on: selinux!

The denials were filtered out by the default settings on EL8, but after doing semodule -DB it turned out that httpd_t was denied the net_admin capability. I just confirmed that allow httpd_t self:capability net_admin removes the delay I was seeing.

I’ll be setting up another OOD instance from scratch in a week or two, and I’ll check to see if the same problems happens, and if I can I will update this issue so others can benefit. For now I basically did:

semodule -DB
# log into a fresh session here
audit2allow -a -M ood_sudo
semodule -i ood_sudo.pp
semodule -B

:man_facepalming: of course it’s something simple. Glad to hear you got it figured out.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.