Current state of slurm-adapter env-export for srun?

Motivation:

As a best-practice, we encourage our researchers to prefix “srun” before all resource-intensive commands in their slurm batch scripts. (TL;DR: better resource reporting.) Unfortunately, out-of-the-box, ondemand’s Job Composer “breaks” srun prefixing within an sbatch script.

Details:

The slurm adapter config page alludes to this via the bigger picture:

Open OnDemand’s Slurm support defaults to issuing CLI commands with the --export flag set to NONE, when Slurm’s default is ALL. This can cause issues with jobs that require srun.

There are two workarounds listed there, and I’ll add two more which I’ve successfully used:

  • use a script_wrapper and export SLURM_EXPORT_ENV=ALL in a before any jobscript runs
  • in your cluster config use copy_environment: true (with the caveat that the PUNs environment is very different from regular shell sessions)
  • add SLURM_EXPORT_ENV=ALL to the Job Composer env (/etc/ood/config/apps/myjobs/env)
  • add an initializer to the Job Composer app

I want to emphasize now that I really, really want to focus only on the Job Composer (aka the “myjobs” app), and not on any other apps. (I know that Batch Connect offers a copy_environment option, but as far as I know, this is not available for Job Composer.) I mention this because almost all discussion I can find on this srun …thing… focuses on batch-connect apps, and my priority is the Job Composer.

Question 1:

Are there any additional approaches, as of OOD v4, for “fixing” the Job Composer? (My apologies for that phrasing, I know it’s an architectural choice. But for my purposes…) I’m especially asking because the third option there I just came across in a Discourse post, and it wasn’t suggested in any documentation I could find, so maybe there are others?

Question 2:

Does the the cluster config copy_environment work for the Job Composer? In my testing, I have never gotten it to work.

I’m not posting this in the Get Help section on purpose, but for curiousity, my config:

$ cat monsoon.yml
# /etc/ood/config/clusters.d/monsoon.yml
---
v2:
  metadata:
    title: "Monsoon"
  login:
    host: "ondemand-dev.hpc.nau.edu"
  job:
    adapter: "slurm"
    cluster: "monsoon"
    bin: "/usr/bin"
    conf: "/etc/slurm/slurm.conf"
    copy_environment: true
  batch_connect:
    basic:
      script_wrapper: |
        module purge
        %s
    vnc:
      script_wrapper: |
        module purge
        export PATH="/opt/TurboVNC/bin:$PATH"
        export WEBSOCKIFY_CMD="/usr/bin/websockify"
        %s

Thank you!

There is a copy_environment option for the Job composer when you go to edit the job options.

Seems like it should, though I’d have to check on why it doesn’t seem to.

That said - we likely need to change the default, it’s just hard to do when there’s so much inertia after so many years. Maybe it’ll come in 5.0 whenever that may be.

Ah yes I had success in my testing this morning using that copy-environment checkbox in job options, which is good to know. Which makes it seem weirder to me that I had no luck with the cluster’s v2.job copy_environment = true config, neither for apps nor for job-composer batch submissions, but maybe I’ll post in the other forum about that.

For the last several years, I’ve been using the initializer linked above to “fix” srun in both dashboard and myjobs. In testing today, I’ve also had success with both (dashboard-)apps and job-composer jobs by adding SLURM_EXPORT_ENV=“ALL” to both /etc/ood/config/apps/{dashboard,myjobs}/env files.

Since the env-file approach seems cleaner– are the two approaches effectively identical, or might there be subtle differences?

I don’t think there’s any difference, no.