Motivation:
As a best-practice, we encourage our researchers to prefix “srun” before all resource-intensive commands in their slurm batch scripts. (TL;DR: better resource reporting.) Unfortunately, out-of-the-box, ondemand’s Job Composer “breaks” srun prefixing within an sbatch script.
Details:
The slurm adapter config page alludes to this via the bigger picture:
Open OnDemand’s Slurm support defaults to issuing CLI commands with the
--exportflag set toNONE, when Slurm’s default isALL. This can cause issues with jobs that requiresrun.
There are two workarounds listed there, and I’ll add two more which I’ve successfully used:
- use a script_wrapper and
export SLURM_EXPORT_ENV=ALLin a before any jobscript runs - in your cluster config use
copy_environment: true(with the caveat that the PUNs environment is very different from regular shell sessions) - add
SLURM_EXPORT_ENV=ALLto the Job Composer env (/etc/ood/config/apps/myjobs/env) - add an initializer to the Job Composer app
I want to emphasize now that I really, really want to focus only on the Job Composer (aka the “myjobs” app), and not on any other apps. (I know that Batch Connect offers a copy_environment option, but as far as I know, this is not available for Job Composer.) I mention this because almost all discussion I can find on this srun …thing… focuses on batch-connect apps, and my priority is the Job Composer.
Question 1:
Are there any additional approaches, as of OOD v4, for “fixing” the Job Composer? (My apologies for that phrasing, I know it’s an architectural choice. But for my purposes…) I’m especially asking because the third option there I just came across in a Discourse post, and it wasn’t suggested in any documentation I could find, so maybe there are others?
Question 2:
Does the the cluster config copy_environment work for the Job Composer? In my testing, I have never gotten it to work.
I’m not posting this in the Get Help section on purpose, but for curiousity, my config:
$ cat monsoon.yml
# /etc/ood/config/clusters.d/monsoon.yml
---
v2:
metadata:
title: "Monsoon"
login:
host: "ondemand-dev.hpc.nau.edu"
job:
adapter: "slurm"
cluster: "monsoon"
bin: "/usr/bin"
conf: "/etc/slurm/slurm.conf"
copy_environment: true
batch_connect:
basic:
script_wrapper: |
module purge
%s
vnc:
script_wrapper: |
module purge
export PATH="/opt/TurboVNC/bin:$PATH"
export WEBSOCKIFY_CMD="/usr/bin/websockify"
%s
Thank you!