Invalid gres with batch_connect

chrisdotcollins · March 30, 2023, 4:20pm

Hi,

I have a Jupyter app based on the example batch_connect approach which works fine, but when I try and make use of GPU by adding in gres, it is failing at submit with an invalid gres error:

“sbatch: error: Invalid generic resource (gres) specification”

The job_script_options shows the following, which from a slurm point of view looks like what I’d expect:

“native”: [
“-n”,
“1”,
“-J”,
“OODJupyter”,
“-p”,
“gpu”,
“–gres=gpu:1”,
“–exclusive”
]

I’ve tried various combinations around how to specify the gres option, but to no avail.

A quick google for ondemand and the error reveals one page ondemand gpu request error Nov 2021 | Ohio Supercomputer Center but no hint at what that problem was (probably unrelated!).

Any assistance would be greatly appreciated!

travert · March 30, 2023, 6:43pm

Nothing is jumping out on the OOD side but I’d be curious to know if SLURM is configured for GPUs.

I think the field you want to ensure is set would be GresTypes=gpu to start.

chrisdotcollins · March 30, 2023, 8:30pm

Thanks for following up. I think things are set up correctly. We have GresTypes=gpu set in our slurm config, and a typical node is set with Gres=gpu:tesla:1

A salloc started with a similar --gres=gpu:1 included run on the same node/partition seems to work (taken from scontrol show job):

   JOB_GRES=gpu:tesla:1
     Nodes=gpu01 CPU_IDs=14 Mem=4000 GRES=gpu:tesla:1(IDX:0)

travert · March 30, 2023, 9:11pm

Thanks for the info. I am curious, what happens if you alter that script to fit the pattern in what worked for you, so:

...
"--gres=gpu:tesla:1",
...

chrisdotcollins · March 30, 2023, 9:35pm

Unfortunately, same invalid gres result

  "native": [
    "-n",
    "1",
    "-J",
    "OODJupyter",
    "-p",
    "gpu",
    "--gres=gpu:tesla:1",
    "--exclusive"
  ]

jeff.ohrstrom · March 31, 2023, 1:37pm

If you remove OOD from the picture, what’s are the arguments you’d have to give for just a shell script?

You can see in the /var/log/ondemand-nginx/$USER/error.log the command you’re actually issuing. You can search for execve. This is the command we’re issuing.

I’d suggest finding what the command should be just from command line testing. Also - you can try --gpus-per-node do the same thing?

chrisdotcollins · March 31, 2023, 2:58pm

Thanks for the info -

The execve line looks correct as far as I can tell:

"execve = [{}, \"sbatch\", \"-D\", \"BLAH\", \"-J\", \"sys/dashboard/dev/jupyterprod\", \"-o\", \"BLAH\", \"-A\", \"viper-standard-user\", \"-t\", \"01:00:00\", \"--export\", \"NONE\", \"-n\", \"1\", \"-J\", \"OODJupyter\", \"-p\", \"gpu\", \"--gres=gpu:tesla:1\", \"--exclusive\", \"--parsable\"]"

And with --gpus-per-node=1 same error:

"execve = [{}, \"sbatch\", \"-D\", \"BLAH\", \"-J\", \"sys/dashboard/dev/jupyterprod\", \"-o\", \"BLAH\", \"-A\", \"viper-standard-user\", \"-t\", \"01:00:00\", \"--export\", \"NONE\", \"-n\", \"1\", \"-J\", \"OODJupyter\", \"-p\", \"gpu\", \"--gpus-per-node=1\", \"--exclusive\", \"--parsable\"]"

So far I’ve been using a jupyter batch_connect but thought I’d test with a task submitted through job composer. A basic test job fails here with the same gres error, but the exact same script works when submitted directly with sbatch from the command line (and has the expected gres):

#!/bin/bash
#SBATCH -J testcc
#SBATCH -p gpu
#SBATCH --gpus-per-node=1

sleep 60
echo $HOSTNAME

jeff.ohrstrom · March 31, 2023, 3:05pm

OK - then I’d suspect the flags --exclusive and --EXPORT NONE.

I assume you’re submitting the script with just.

sbatch delme.sh

Can you remove the #SBATCH comments and try with variations of command line arguments?

sbatch --exclusive -p gpu --gres=gpu:tesla:1 delme.sh
sbatch --export NONE --exclusive -p gpu --gres=gpu:tesla:1 delme.sh
sbatch --export NONE -p gpu --gres=gpu:tesla:1 delme.sh
sbatch -p gpu --gres=gpu:tesla:1 delme.sh

Also be sure that you’re submitting from the web host itself (the host that OOD is installed on). You may have a slurm conf mismatch or similar. In any case, it’s just good to replicate the commands on the same system to track down what’s going on.

chrisdotcollins · March 31, 2023, 4:04pm

You are on to something there! Sorry, I should have tested on the same host but generally have our OOD server restricted to direct user logins so by force of habit have just been submitting the comparison test jobs from another host. But anyway, it does indeed fail from the command line on that host.

I’m running configless slurm, and it looks like for some reason, gres.conf hasn’t made it to the OOD server host. I just need to give it a kick and work out why!

Thanks so much for the help, hopefully this will help someone else avoid the same mistake in future!

system · September 27, 2023, 4:04pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to use --gres=shard:64 in schrodinger submit.yml.erb Get Help	2	105	March 15, 2025
Interactive-apps sbatch: error: Batch job submission failed Get Help ondemand2	3	308	December 26, 2022
Interactive app -Desktop : can not find command Get Help ondemand2 , question	11	1005	December 11, 2022
Customizing MATLAB form.yml and submit.yml.erb, simplifying default; different versions referenced Get Help	20	168	May 14, 2025
MATLAB sbatch: error: invalid partition specified Get Help question	76	876	October 9, 2024

Invalid gres with batch_connect

Related topics