Invalid gres with batch_connect

Hi,

I have a Jupyter app based on the example batch_connect approach which works fine, but when I try and make use of GPU by adding in gres, it is failing at submit with an invalid gres error:

“sbatch: error: Invalid generic resource (gres) specification”

The job_script_options shows the following, which from a slurm point of view looks like what I’d expect:

“native”: [
“-n”,
“1”,
“-J”,
“OODJupyter”,
“-p”,
“gpu”,
“–gres=gpu:1”,
“–exclusive”
]

I’ve tried various combinations around how to specify the gres option, but to no avail.

A quick google for ondemand and the error reveals one page ondemand gpu request error Nov 2021 | Ohio Supercomputer Center but no hint at what that problem was (probably unrelated!).

Any assistance would be greatly appreciated!

Nothing is jumping out on the OOD side but I’d be curious to know if SLURM is configured for GPUs.

I think the field you want to ensure is set would be GresTypes=gpu to start.

Thanks for following up. I think things are set up correctly. We have GresTypes=gpu set in our slurm config, and a typical node is set with Gres=gpu:tesla:1

A salloc started with a similar --gres=gpu:1 included run on the same node/partition seems to work (taken from scontrol show job):

   JOB_GRES=gpu:tesla:1
     Nodes=gpu01 CPU_IDs=14 Mem=4000 GRES=gpu:tesla:1(IDX:0)

Thanks for the info. I am curious, what happens if you alter that script to fit the pattern in what worked for you, so:

...
"--gres=gpu:tesla:1",
...

Unfortunately, same invalid gres result

  "native": [
    "-n",
    "1",
    "-J",
    "OODJupyter",
    "-p",
    "gpu",
    "--gres=gpu:tesla:1",
    "--exclusive"
  ]

If you remove OOD from the picture, what’s are the arguments you’d have to give for just a shell script?

You can see in the /var/log/ondemand-nginx/$USER/error.log the command you’re actually issuing. You can search for execve. This is the command we’re issuing.

I’d suggest finding what the command should be just from command line testing. Also - you can try --gpus-per-node do the same thing?

Thanks for the info -

The execve line looks correct as far as I can tell:

"execve = [{}, \"sbatch\", \"-D\", \"BLAH\", \"-J\", \"sys/dashboard/dev/jupyterprod\", \"-o\", \"BLAH\", \"-A\", \"viper-standard-user\", \"-t\", \"01:00:00\", \"--export\", \"NONE\", \"-n\", \"1\", \"-J\", \"OODJupyter\", \"-p\", \"gpu\", \"--gres=gpu:tesla:1\", \"--exclusive\", \"--parsable\"]"

And with --gpus-per-node=1 same error:

"execve = [{}, \"sbatch\", \"-D\", \"BLAH\", \"-J\", \"sys/dashboard/dev/jupyterprod\", \"-o\", \"BLAH\", \"-A\", \"viper-standard-user\", \"-t\", \"01:00:00\", \"--export\", \"NONE\", \"-n\", \"1\", \"-J\", \"OODJupyter\", \"-p\", \"gpu\", \"--gpus-per-node=1\", \"--exclusive\", \"--parsable\"]"

So far I’ve been using a jupyter batch_connect but thought I’d test with a task submitted through job composer. A basic test job fails here with the same gres error, but the exact same script works when submitted directly with sbatch from the command line (and has the expected gres):

#!/bin/bash
#SBATCH -J testcc
#SBATCH -p gpu
#SBATCH --gpus-per-node=1

sleep 60
echo $HOSTNAME

OK - then I’d suspect the flags --exclusive and --EXPORT NONE.

I assume you’re submitting the script with just.

sbatch delme.sh

Can you remove the #SBATCH comments and try with variations of command line arguments?

sbatch --exclusive -p gpu --gres=gpu:tesla:1 delme.sh
sbatch --export NONE --exclusive -p gpu --gres=gpu:tesla:1 delme.sh
sbatch --export NONE -p gpu --gres=gpu:tesla:1 delme.sh
sbatch -p gpu --gres=gpu:tesla:1 delme.sh

Also be sure that you’re submitting from the web host itself (the host that OOD is installed on). You may have a slurm conf mismatch or similar. In any case, it’s just good to replicate the commands on the same system to track down what’s going on.

You are on to something there! Sorry, I should have tested on the same host but generally have our OOD server restricted to direct user logins so by force of habit have just been submitting the comparison test jobs from another host. But anyway, it does indeed fail from the command line on that host.

I’m running configless slurm, and it looks like for some reason, gres.conf hasn’t made it to the OOD server host. I just need to give it a kick and work out why!

Thanks so much for the help, hopefully this will help someone else avoid the same mistake in future!

1 Like

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.