I can see cores. nodes and GPU but would like a bar to show used memory. Is it possible to add this?
Like memory in total being used across the entire cluster? This panel information is geared mostly towards end users trying to use the system, not system admins trying to gauge system utilization. So with that said, Iâm not really sure what a % of cluster memory being displayed would tell the end user about availability to run their job.
We have a custom app where the user puts in the desired memory to use. If they can see what is currently being used and what is currently available it will help them.
Iâm a little confused. Just to be clear, you are talking about the âSystem Statusâ app that lives at âpun/sys/dashboard/system-statusâ?
Maybe you can help walk us through a theoretical example? Letâs say for example you have a cluster of 10 nodes, each with 40 cores and 1TB of memory (letâs ignore GPUs for now).
Right now (with the way the app works), if no jobs are running on the cluster, the Cluster Status view shows 10 nodes available, 400 cores available.
If say somebody is running a job that uses 2 full nodes, it would show 8 nodes available, 320 cores available.
It sounds like what you are asking for is that in the first (no jobs) case you would want it to show 10TB of memory available, and in the second case you would want it to show 8TB of memory available?
Is that indeed what you are asking for?
If so, that seems a bit esoteric to me, since usually clients want to know the available memory on a single node, not across the entire cluster? But maybe Iâm missing a use case here?
So under cluster>system status is where I would like for it to show available memory. I know its kind of redundant but my users a requesting it to show total memory and memory being used. If I can display available memory in each node then that will also work. This essential from my understanding will give them something to reference when the setup a job. So if the cluster has 4gb available they know that there is at least that much to currently run a job without waiting in the queue.
Sorry, but Iâm still a little confused and you seem to be switching between terminology quite a bit here. I canât imagine a situation where a âclusterâ would have only say 4GB available, unless that âclusterâ is in reality maybe just one or two nodes? And even in that situation how would it knowing that help a client make changes to a job to prevent waiting in the queue?
For example, below is a screenshot of the current memory usage on one of OSCâs clusters. Itâs showing that 24.3 TiB are in use out of 269 TiB. Is this the type of thing you are asking for?
It would really help if you could provide some more specific examples / use case for us. Note, the System Status app only works with Slurm, so the ideal thing would be if you could give us some sort of specific Slurm parameters or sinfo / scontrol commands to execute that return the specific info youâd like displayed.
yes. The 4GB was just an example, I just need a simple way for a user to view this type of stat.
