Ondemand with slurm based sytems, sbatch?

So we are having an issue with submitting jobs fro the job composer. Not all jobs but a vey select few. When we sbatch a jobscript it works fine from the command line, but from the job composer it fails with an odd error:
slurmstepd: error: execve(): magma: No such file or directory
srun: error: cn10: task 0: Exited with exit code 2

Magma is on the path because of a module load in the job script. The question I have is, does ondemand use sbatch or some other method to submit the script?

1 Like

Hello,

When using the Slurm adapter OnDemand does indeed call sbatch:

May I see the job script you are trying to run?

Hello,

We’ve been able to narrow the issue down srun not having access to the invoking environment… i.e.: running /bin/env within the batch file will yield very different results from srun /bin/env Most notably, the PATH is empty when using srun.

After submitting this script with the Job Composer, srun’ing env will print only SLURM_* and a handful of related variables, without any PATH at all.

#!/bin/bash

echo “** Using /bin/bash…”
echo “** module load perl…”
module load perl

echo “”; echo “** srun /usr/bin/which perl…”
srun /usr/bin/which perl

echo “”; echo “** /bin/env…”
/bin/env

echo “”; echo “”; echo “** srun /bin/env…”
srun /bin/env

There are a lot of moving parts, and it took us quite a while debugging in various ways to get this going.

Here are some hints. First, in our environment, we try to pass the entire user environment through by default. The behavior of how this works changed some in a recent SLURM release. Currently, we have this as part of ‘submit.yml.erb’:

script:
# does-not-work --> job_environment: "ALL"
native:
    - "--export=ALL"

and this as part of ‘before.sh.erb’:

unset XDG_DATA_DIRS
unset XDG_RUNTIME_DIR
unset SBATCH_EXPORT
unset MAIL
unset PYTHONPATH
unset PYTHONUNBUFFERED

export LOGNAME=$(whoami)
export USERNAME=$LOGNAME
# mysteriously fails? (perm denied?)
# export XDG_RUNTIME_DIR=/run/user/$(id -u)

Probably more hacking as well, but this is what I see immediately.

Hope that helps.

Remember in YAML formating, types and indentation are key. Maybe those didn’t work because of that?

script:
   job_envorionment:
      # job environment is a map of 'key: value'
      FOO: "bar"
      LOGNAME: $USER
   # native is under script
   native:  
   -  "--export=ALL"

Thank you both, michaelkarlcoleman and johrstrom; we’ll look into these suggestions.

I actually just stopped by to add a few additional pieces of information to this puzzle:

First some background: this issue started for us after a maintenance day in which we upgraded both OOD and Slurm at the same time. (to Slurm 19.05.3-2; OOD 1.6.20/1.35.3)

According to the srun docs/manpage:

–export=<environment variables [ALL] | NONE>
Identify which environment variables are propagated to the launched application. By default, all are propagated.

But nonetheless, even though that implies this shouldn’t be necessary: env vars availability returns when explicitly specifying export=all within each "srun --export=ALL " statement. (Additionally, all of this holds true whether or not there’s an “#SBATCH --export=all” at the top of the script.)

(I’ve not thoroughly picked apart our yaml/configs yet since they worked with the previous version but I did validate their YAML.)

OnDemand version: v1.6.20 | Dashboard version: v1.35.3
Slurm version: slurm 19.05.3-2
 ** module load perl ...

 ** srun /usr/bin/which perl ...
/usr/bin/which: no perl in ((null))
srun: error: cn63: task 0: Exited with exit code 1

 ** srun --export=ALL /usr/bin/which perl ...
/packages/perl/5.28.1/bin/perl

 ** /bin/env | grep -vc SLURM ...
 ** /bin/env | grep ^PATH ...
35
PATH=/packages/perl/5.28.1/bin:/packages/git/2.16.3/bin:<SNIP!>

 ** srun --export=ALL /bin/env | grep -vc SLURM ...
 ** srun --export=ALL /bin/env | grep ^PATH ...
40
PATH=/packages/perl/5.28.1/bin:/packages/git/2.16.3/bin:<SNIP!>

 ** srun /bin/env | grep -vc SLURM ...
 ** srun /bin/env | grep ^PATH ...
6

We do not have a submit.yml.erb or before.yml.erb set of files. Would it be possible to include the submit section in /etc/ood/config/clusters.d/my_cluster.yml?

I don’t think there is a way to easily modify the Job Composer’s submission arguments though that does sound like a good idea.

If users are able to successfully submit the job using sbatch from the command line from a login node but the same sbatch is failing from the web node, there is another approach you could take.

You can provide a wrapper script for sbatch that will ssh to the login node and execute sbatch there. Here is an example wrapper script:

and associated overrides https://github.com/puneet336/OOD-1.5_wrappers/tree/master/openondemand/1.5/wrappers/slurm/bin

See https://osc.github.io/ood-documentation/master/installation/resource-manager/slurm.html. You would deploy your wrapper script for sbatch on the webnode, such as /usr/local/bin/sbatch_ssh_wrapper and then modify the cluster config to use this for sbatch:

 job:
   adapter: "slurm"
   cluster: "my_cluster"
   bin: "/path/to/slurm/bin"
   conf: "/path/to/slurm.conf"
+  bin_overrides:
+    sbatch: "/usr/local/bin/sbatch_ssh_wrapper"

If you do this, it affects how all of OnDemand submits to that particular cluster, not just the Job Composer. Here is a relevant Discourse discussion: Question About Passing Environment Variables for PBS Job

Our sbatch script runs fine on the web node when sbatched from a terminal window (cli) launched by ood. It just has environement issues when submitted from the job composer. This job that I am using for testing used to run just fine from the job composer. Something has changed with the latest release.

This has been solved in 1.7 which was released today. It was fixed for all support schedulers, including SLURM.

To enable it, use the copy_environment attribute in your script element. The given scheduler will set the appropriate flags.

In slurms’ case it appears that you also need to set job_environment with at least something.

This configuration in SLURM:

script:
   copy_environment: true
   job_envorionment:
     FOO: "BAR"

will produce --export=ALL,FOO when submitting the job.

Hi Jeff,

We have this issue in ondemand ver 1.8.18.

Where do we put:

script:
copy_environment: true

To rectify this issue in ondemand 1.8.18? We don’t appear to have a submit.yml.erb file.

Thanks.

Best,
Chris

Hi and welcome!

submit.yml.erb files are in every batch connect application’s directory itself. So for us, when we deploy an app called bc_osc_jupyter to /var/www/ood/apps/sys/bc_osc_jupyter it has it’s own submit.yml.erb file in that directory.

Or if it’s for a desktop maybe it’s /etc/ood/config/bc_desktop/submit/submit.yml.erb.

Hi Jeff, thanks for that lightning-speed reply to Chris!

Could I beg you verify: we want to target submit.yml.erb not script.yml.erb, right?

# So we'd modify:
/etc/ood/config/myjobs/submit/submit.yml.erb
# rather than:
/etc/ood/config/myjobs/submit/script.yml.erb

And the contents of submit.yml.erb should resemble this:

$ cat /var/www/ood/apps/sys/myjobs/submit.yml.erb
---
script:
  copy_environment: "true"

Is that correct?
And to be fully explicit, we are not modifying template/script.sh.erb, right?

Sadly, if all of the above is correct, we seemingly still are not successfully copying-over the environment. :frowning:

To expand slightly on Chris’ inquiry and provide context to future forumdwellers: we’re trying to fix our setup to allow users to use per-line srun someCommand syntax in batch jobs submitted to Slurm via the Job Composer App. This is probably best explained with some examples:

This job will succeed:

#!/bin/bash
#SBATCH --job-name=test_no_srun
hostname

This job will also succeed:

#!/bin/bash
#SBATCH --job-name=test_export_all
srun --export=all hostname

But this job will fail:

#!/bin/bash
#SBATCH --job-name=test_export_none
srun hostname

…with output suggesting the environment is bone dry:

slurmstepd: error: execve(): hostname: No such file or directory
srun: error: cn69: task 0: Exited with exit code 2

My understanding of srun’s functionality inside sbatch scripts is that while technically optional, it allows us to collect more metadata on the job.

If relevant, this seems to be similar or the same as our former EL6/OOD 1.6 situation where we’d previously patched the ruby directly to regain “srun functionality”.

Thanks for your assistance!
Jason

@jeff.ohrstrom Hi. I’d like to ping this thread if possible. We’re running 2.0.13 and have a similar issue. I’m trying to run an mpi job which fails when launched from the Job Composer app, but works fine from a command line. It seems like the same issue here, where the environment variable SLURM_EXPORT_ENV=“NONE” is set when launching from Job Composer. On the command line version, SLURM_EXPORT_ENV isn’t set at all, but all of the environment is there.

I’ve seen a few topics posted with similar questions, but each time it’s some other app but Job Composer (like bc_desktop or jupyter). I don’t see a submit.yml.erb in the myjobs app at all, and placing the script options in batch_connect globally (in the cluster.yml file) doesn’t work either.

Any hints on where to modify to get the environments right?

Thanks.

@jeff.ohrstrom @jasonbuechler Looks like I answered my own question. I modified /etc/ood/config/apps/myjobs/env and placed SLURM_EXPORT_ENV=“ALL” in it, restarted the PUN and now the mpi job works. Hopefully that works for other folks as well.

Thanks.

Hi apologies for never getting back @jasonbuechler . You’re right that it is submit not script (I’ve updated the comment).

The issue with putting this into YAML is that it should be a boolean type not a string (so no quotes), though with the job composer having shims to older libraries, the environment variable may be the only way to set it for myjobs.

$ cat /etc/ood/config/bc_desktop/submit/submit.yml.erb
---
script:
  copy_environment: true

There is no single submit.yml.erb for myjobs in /etc or /var/www, so the job composer doesn’t read or know anything about this file.