Inconsistent environment variables in fork of bc_osc_codeserver

We have an Open OnDemand set up running in AWS Parallel Cluster with Slurm as our scheduler. In that configuration, we have a fork of the bc_osc_codeserver app set up and configured to work with our setup and software. Early on in the term, users noticed that their files were sometimes missing when using the app. We determined that this was because the app wasn’t getting the $HOME environment variable some of the time, and using an incorrect home directory rather than one in the shared EFS drive mounted to all of the nodes in our cluster. We’ve configured the app to set the $HOME variable appropriately for our setup if it doesn’t see one, so that problem is solved, but I haven’t been able to find a root cause. I’ve added some lines to the codeserver script to log the initial home directory and the one that’s set by the start script, and it’s still inconsistent. Sometimes the HOME variable is present, sometimes it’s not. I haven’t been able to find a pattern in when it’s set or not (not limited to specific users or time periods). All of the sbatch commands to launch interactive apps use --export NONE, so my thought is that when slurm will “implicitly attempt to load the user’s environment on the node where the script is being executed”, that attempt is sometimes failing. However, I haven’t had any luck turning up more information about what’s happening when this environment variable fails to load. I’m considering trying to change the default behavior to use --export NIL on the export commands instead, so at least the behavior will be consistent, and we’ll have a steady baseline as we expand our offering of interactive apps. Any insight into what could be happening would be much appreciated.

OK - I did a lot of digging on this and the TLDR is basically that the /etc/passwd or LDAP is always responsible for this value. Slurm’s daemon should then set the environment variable for the same.

Login programs like sshd are responsible for setting the environment variable (slurmd in this case) but the actual value always comes from /etc/passwd or an external LDAP. So I would look there.

Here’s a Ruby one liner you can use to test (because I don’t know the equivalent bash if you use LDAP). Submit this file with --export=NONE to try to replicate, but that’s essentially the same libraries (Etc & getpw) Slurm is using.

#!/bin/bash

ruby -e "require 'etc'; puts Etc.getpwnam(Etc.getlogin)"

So given you’re in AWS I would wonder what your /etc/passwd looks like if you have local users or if the LDAP system is somehow misfiring.

Thanks for taking the time to look at this! We are indeed using LDAP. I’ll test out this script, and investigate our LDAP setup and look for relevant logs. We have SSSD in the mix too, and there’s a fallback homedir that we have to set in that as well, so that’s another potential point of failure to look at. I’ll follow up on these leads and post an update when I know more. Thanks again!

Update after some investigation, posting so future folks don’t find this thread as a dead end. It turned out to be LDAP on our end. I poked around the SSSD logs on compute nodes, and found that when the environment variables weren’t getting populated, SSSD considered the LDAP server to be offline. It was doing that after timing out on requests to the LDAP server, so uid lookups were taking more than 6 seconds. Now we’re working with the IAM group that manages that LDAP server to make sure that it’s indexed appropriately for SSSD, since those queries shouldn’t be timing out. But it was looking at the SSSD logs on the compute node running the job with a missing environment variable and finding the query that hit SSSD at that time for that user that gave the evidence that I needed to figure out what’s going on.

1 Like