SLURM job environment

I noticed that when in OnDemand interactive jobs, the subsequent “srun” commands start with empty environment.

That is, e.g. if I start Interactive Desktop, and open a terminal in this desktop, I have full environment, but, if I subsequently start an executable with a srun command (e.g. to run MPI), I get:

$ srun -n 1 hostname
slurmstepd: error: Unable to create TMPDIR [/scratch/local//2666718]: Permission denied
slurmstepd: error: Setting TMPDIR to /tmp
slurmstepd: error: execve(): hostname: No such file or directory

which is equivalent to “srun --export=NONE”.

Digging into the OOD’s SLURM adapter code at ood_core-0.13.0/lib/ood_core/job/adapters/slurm.rb, there is a section:



          # we default to export NONE, but SLURM defaults to ALL.
          # we do this bc SLURM setups a new environment, loading /etc/profile
          # and all giving 'module' function (among other things shells give),
          # where the PUN did not.
          # --export=ALL export the PUN's environment.
          def export_arg(env, copy_environment)
            if !env.empty? && !copy_environment
              env.keys.join(",")
            elsif !env.empty? && copy_environment
              "ALL," + env.keys.join(",")
            elsif env.empty? && copy_environment
              # only this option changes behaivor dramatically
              "ALL"
            else
              "NONE"
            end
          end

I am wondering if that is responsible for setting the “–export=NONE” for the srun commands inside of the OOD job, or, if this is just for “sbatch” that starts the interactive job.

Or, any other thoughts in this regard?

Thanks,
MC

Additional info. The “srun” inside of the OOD interactive job works correctly if I comment out

          args.concat ["--export", export_arg(env, script.copy_environment?)]

in the slurm.rb file. Does the --export=NONE come as a default from this file? I am not good at Ruby debugging and haven’t figured out how to see what is the actual sbatch line so I don’t know for sure.

Thanks,
MC

One more update as I think I understand why OOD does not want to export all the submission shell environment - because that’s the web server environment that has all the Ruby paths, etc. So, you do need to do some “sbatch --export” to remove that.

Doing “srun --export=ALL” inside of the OOD interactive app is a reasonable workaround, though, it’s making me wonder if there has been some change in SLURM’s behavior as we did not have this issue in the past, when I was setting up the interactive app (Ansys Workbench) that uses the “srun” to find the hosts to run on. From what I can tell this issue was on SLURM version 19.05 that we ran last year and 20.11 that we run now. I was likely setting up the app on 17.11.

I am wondering if someone out there is on older versions of SLURM and could do this simple test to see what they get. That is, start the Interactive Desktop app and in the terminal, run “srun -n 1 hostname”. If the “hostname” command is not found, you have the same issue that I am describing, if it is, I’d be curious to know the SLURM version.

Thanks,
MC

A thousand apologies for not circling back to this sooner. Yes and no. Basically, because we set export=NONE when we submit the job - srun inherits that because it’s an environment variable in the job, as you may have found out.

A mitigate for you may be to set the environment variable SLURM_EXPORT_ENV within the job. At this point it’s NONE, so just exporting something different (ALL) works.

The idea here is to set NONE on the web host when you submit, but as soon as the job starts going, you set it to ALL so that when you run srun interactively you don’t have to worry about and it uses the default everyone’s used to.

That’s the config we use for bc_desktop specifically, but you can use the same batch_connect.before_script globally (for both vnc or basic jobs) by add it to your cluster config file.

https://osc.github.io/ood-documentation/latest/reference/files/submit-yml-erb.html#setting-batch-connect-options-globally