System Status in OOD 4.0 not showing GPU utilization

Running into an odd issue. The system status application shows that every GPU on the system is available, when that is not the case. Is there some sort of NVML dependency that needs to be installed for the system status page to update properly?

I’ll have to lookup the code to see how GPUs are calculated. IIRC it seems it works off of a GRES. Like the job info needs gpu in the GRES column or similar.

What does squeue -o %b look like for these jobs?

Here is the output:

N/A
N/A
N/A
N/A
N/A
gres/gpu:8
gres/gpu:8
gres/gpu:8
gres/gpu:8
N/A
gres/gpu:8
N/A
N/A
gres/gpu:8

However, it looks like that’s showing the output of TresPerNode instead of the output of AllocTRES.

I looked up the code and we use %b from squeue to get this information. I’m guessing that means AllocTRES from your comment? (reading the squeue documentation, it doesn’t even mention %b, so I’m not sure where we got that from).

I have no idea where %b comes from either, but %b seems to be outputting the value of TresPerNode. I don’t know if this is by design or not though. For instance, if a user requests two nodes, with 8 GPUs each, %b will still return gres/gpu:8 instead of gres/gpu:16.

@jeff.ohrstrom Just wanted to see if you had a chance to look into this at all?

No, haven’t looked at it yet, but did just file this ticket upstream so I don’t forget entirely.

1 Like

Sorry to sound like a broken record, but is there any update on this?

No, this isn’t likely to come until the next release - 4.1.

Would you be willing to identify where in the ood code we can find this instruction? Seems to be a standard exercise to extract the info from the slurm-generated string – and slurm can certainly be tempermental with the string format, depending on specific slurm version.

I think the bug ticket above has the details.

Thanks, Jeff – Yes, it would seem that the slurm definitions in this file have changed since Slurm 18. My own resources script relies on using sinfo, actually:
gpuframe=$($sinfocmd -h -N -p aisc -O partition,gres,gresused:30)

Since Slurm is Slurm, of course the flags are not 100% consistent between sinfo and squeue. Certainly taking a measured approach to updating the code makes sense.

And learning, Slurm 24.05.3 man page for squeue does not include ‘%b’, nor does ‘Slurm Workload Manager - squeue

verifying the comment by Kurt, using ‘-o %b’ does output a column ‘TRES_PER_NODE’. I like the following output, because it is regular. nodename isn’t necessary, just interesting while evaluating:

squeue --state=running -o %N,%b | grep -v ‘N/A’

Is there a way to disable the system status link under the clusters drop-down on the OOD dashboard until the fix is pushed?

Hi and welcome!

Yes just follow these instructions:

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.