Hello,
I was able to integrated Grafana usage graphs into our Open OnDemand using the instructions provided in the documentation. In the documentation it lists the ‘cpu’ and ‘memory’ panels. Is there a way to get other panels to display. We also capture gpu and gpu memory usage which we would like to also display (obviously blank/empty graphs if no gpu was requested … that’s fine). I could find if there were ‘cpu’ and ‘memory’ are the only panel YAML keys available.
Without looking at the code, I suspect there’s no support for it. However, the panel should be providing a link to the more complete dashboard that users can navigate to.
That said - I did create this ticket to potentially support this:
An OSC colleague commented on the same github ticket that tracking GPU usage from Slurm and getting that into Prometheus is a bit of a challenge. Can you detail how you do the same?
I’m using this Prometheus exporter: https://github.com/plazonic/nvidia_gpu_prometheus_exporter
In addition to running the exporter, I needed these scripts in our SLURM prolog.d and epilog.d directories, respectively. Discussion and examples of these can be found in the GitHub for the jobstats project from our neighbors down the road at Princeton University. https://github.com/PrincetonUniversity/jobstats
[root@gpu-node001 prolog.d]# cat /etc/slurm/prolog.d/gpustats_helper_prolog.sh
#!/bin/bash
[ -z $CUDA_VISIBLE_DEVICES ] && exit 0
DEST=/run/gpustat
[ -e $DEST ] || mkdir -m 755 $DEST
for i in ${GPU_DEVICE_ORDINAL//,/ } ${CUDA_VISIBLE_DEVICES//,/ }; do
echo $SLURM_JOB_ID $SLURM_JOB_UID > $DEST/$i
done
exit 0
Here’s what the end-users see about their job usage if they click the link above the thumbnails of the images of the cpu and memory usage in OOD.
Thanks for the additional information! I’ve updated the github ticket for the same. Yes I’m guessing you want the GPU Utilization
panel to show up in OnDemand.
I’m guessing you want the
GPU Utilization
panel to show up in OnDemand.
Yes. GPU Memory Utilization too would be nice … basically mirroring what we can display now for CPU and memory.
This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.