Interactive jobs started inside Interactive Desktop don't export environment variables (Slurm)

We’re using Open OnDemand 1.6.22 with Slurm 19.05.4.

A common use case for our users is to use long-running Interactive Desktop sessions to launch shorter-running interactive jobs which need their own resources. This is done by using salloc to request a new allocation, and then using srun to launch the interactive jobs with this allocation.

I’m seeing an unexpected behavior, which is that by default it appears no environment variables are exported to interactive (srun) jobs when run inside of the OOD Interactive Desktop app. In order to avoid a job erroring out, I have to explicitly export all environment variables. This is not the behavior I see with interactive jobs, or nested interactive jobs, outside OOD.

Example:

nc060:~ % salloc -c4
salloc: Granted job allocation 16924909
nc060:~ % srun --pty bash
slurmstepd: error: execve(): bash: No such file or directory

Here is the workaround:

nc060:~ % salloc -c4
salloc: Granted job allocation 16924910
nc060:~ % srun -c4 --export=ALL --pty bash
nc265:~ %

Based on my read of the srun man page, the default Slurm behavior should always export all environment variables to an interactive job.

Can anyone shed light on why this behavior is not seen in Open OnDemand? Thank you.

Yes, in 1.6 this functionality doesn’t exist (we always pass the environment variable SBATCH_EXPORT=NONE.

In 1.7 it does not pass this by default (it passes nothing, which in turn uses SLURM default) and that actually breaks some desired behaviors, so it’s slightly broken and I hope to patch it in the coming days.

If you upgrade to 1.7 and you need to export additional entries, you’ll need to add these to the submit.yml.erb.

script:
  copy_environment: true
  job_environment:
    FOO: BAR

Which would translate to --export=ALL,FOO

@jeff.ohrstrom Am I understanding the logic behind not using SLURM_EXPORT=ALL as the default for OOD is that, if OOD did so, it would export OOD’s environment, not the users? That’s from reading the comments in
https://github.com/OSC/ood_core/pull/193 lines 632-636

We’ve been trying to figure out why we couldn’t get OpenMPI to run from within a Remote Desktop application, and it was the setting of SLURM_EXPORT=NONE that was the problem.

I am still not quite clear how it could be a problem submitting a job generally but not be a problem when it is set in submit.yml.erb? Would you mind elucidating for me?

This might also be worth putting a note about in the README or other documentation for the Remote Desktop?

OOD’s environment is specific to the current user, but it is limited. As an example, it doesn’t have the module function. OOD doesn’t load /etc/profile (and subsequently anything from /etc/profile.d/ or your own ~/.bashrc). So it is limited in that regard. And yes, Slurm would export it which is rather specific to operating OOD. There are ways in OOD to fix this, by adding a script_wrapper or header to the submit.yml.erb and force loading /etc/profile there.

So really it comes down to shell environments. When you export NONE, Slurm does everything to setup a new environment. When you use ALL, no new environment is setup and since OOD never setup a shell environment, you’d miss out on all of this (specifically the exported function module is what’s commonly required).

Hope that helps!

At least on our systems, this seems to cause issues as well. I have not figured out if it is a race condition in SLURM or in our env setup script. Basically, I can do an sbatch --export=NONE script.sh where the script simply spits out env on a login node logged in as a user. If I wrap that in a for loop to launch a bunch of jobs, in 5% of the cases, the env is limited to SLURM env variables only, ie the environment is only partially setup. What is super odd is it seems to be user specific and within users, also slurm account specific.

In our case, for another reason (slurm version compatibility between supported clusters), I have converted to bin_overrides for all slurm commands such that the commands are now ssh clusterlogin slurm_command. What I think this means is that we should have the environment from the ssh to login as user.

We recently had to fix this for our Slurm cluster too at OSC

You can see here how we did it. We just set export SLURM_EXPORT_ENV=ALL in the batch_connect.before_script portion of our submit yml.

Thanks Jeff. If I want this set globally, it would put in the cluster.yml?

@rsettlag yes the docs for that are here:

https://osc.github.io/ood-documentation/latest/reference/files/submit-yml-erb.html#setting-batch-connect-options-globally