We got alerted by an user that the Active Job memory reporting (when one expands the job info) is not correct. This may be specific to SLURM.
The problem seems to be that the SLURM OOD adaptor seems to use “squeue -o %m” option for the Memory report, which corresponds to the MinMemoryCPU value in the SLURM config.
In our case, we have MinMemoryCPU=0, which means the whole node’s memory, for the non-shared partition (only one job per node). Expanded Active Jobs job info shows the memory as:
Memory 0
which is the MinMemoryCPU, but, the actual job’s memory is the whole node (AFAIK not shown by any squeue command option).
For our shared partitions - where multiple jobs can run on a node, we have for example MinMemoryCPU=4000M, that is we give 4 GB per CPU core, and when asking for 4 cores, the correct job memory 16 GB. The Active Jobs shows:
Memory 4000M
which again is the MinMemoryCPU - value per 1 core.
Now, I am not sure what to do about this, as it is possible that other sites may have other policies. Though, regardless of site SLURM config, the MinMemoryCPU=0 is the whole node so perhaps instead of 0 we could display “Whole node”? For the shared, we could use simple math, MinMemoryCPU*NumCPUs ("squeue -o “%m %C”), though the MinMemoryCPU is a string so it’d need to be parsed.
I am not sure if this is worth the fuss but it does confuse our users as the memory they think they requested is not shown in the Active Jobs, so it would be nice to give some thoughts.
On a similar note, it would be useful to show max memory allocated to the job in the My Interactive Sessions, along with the App name, job #, # of nodes and cores, like e.g.:
RStudio server on Notchpeak (1409006) 16 GB memory | 1 node | 4 cores | Running
Please, let me know what you think about this.
Thanks,
MC