Hello from rainy Oslo, Norway! First of all, thank you for this amazing project! I’ve been setting up OnDemand for our local cluster and it’s been going well - but I’ve hit one problem that I’m having a hard time figuring out.
When a user logs in “cold” (no PUN running), there is always a ~30 second delay.
Long story short, it’s caused by something systemd-related timing out trying to sudo:
Feb 17 09:10:19 ood-dev01.educloud.no sudo[417843]: apache : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/opt/ood/nginx_stage/sbin/nginx_stage pun -u ec-buzh -a https%3a%2f%2food.educloud.no%3a443%2fnginx%2finit
Feb 17 09:10:20 ood-dev01.educloud.no systemd[1]: Started Session c1 of user root.
Feb 17 09:10:45 ood-dev01.educloud.no sudo[417843]: pam_systemd(sudo:session): Failed to create session: Connection timed out
Feb 17 09:10:45 ood-dev01.educloud.no sudo[417843]: pam_unix(sudo:session): session opened for user root by (uid=0)
Feb 17 09:10:45 ood-dev01.educloud.no sudo[417843]: pam_unix(sudo:session): session closed for user root
Feb 17 09:10:46 ood-dev01.educloud.no httpd[65310]: oida + exec
Feb 17 09:10:46 ood-dev01.educloud.no httpd[65310]: + exec
As you can see, there is a 25 second delay from “Started Session” until the pam_systemd(sudo:session) timeout. Once the PUN is running, everything is fine - it’s just that sudo call that delays things.
The “oida + exec” is there because I added some debug output to pun_proxy.lua trying to pin down where it happened:
err = nginx_stage.pun(r, pun_stage_cmd, user, app_init_url, pun_pre_hook_exports, pun_pre_hook_root_cmd)
if err then
r.usleep(1000000) -- sleep for 1 second before trying again
print("oida" .. " " .. err)
end
I was initially thinking it happened in the lua script, but it turns out it must be something in systemd-logind or something related. It looks very similar to this old systemd issue:
However, it’s not exactly the same. Also it seems nobody else on the internet ever had that happen with sudo:session (or at least that’s what Google says).
I tried adjusting dbus so that the timeouts that were around 25 seconds were shorter, hoping that it would help me further isolate the problem, but without luck - the sudo is still delayed by 25s every time.
Any ideas about what this could be?
I’m running RHEL 8.7 in vmware with OnDemand 2.0.29-1.
Best,
Andreas at the University of Oslo dept of Scientific Computing Services.