Diagnosing and Addressing Slowdowns

Our cluster has 5 nodes, each with 8 GPUs, 64 CPUs (x4 virtual cores), 1 TB of RAM, and 22 TB of local SSD storage. We have a 300TB HDD NAS for our data using NFS. Users submit jobs using SLURM to a SLURM controller VM that then delegates the job to one of the nodes. The jobs use data from the NAS, must read and write locally, and typically communicate with external servers via the internet. Occasionally, we have massive slowdowns of up to 100x slower execution times in the middle of a job or delays of up to 10 minutes before a job starts when many jobs are running concurrently. We have approximately 50 unique users using our cluster.

What could be possible reasons for our slowdowns and how would you try to fix them? Just explain how you would approach the problem and outline the broad steps you would take.

I will have to ping @tdockendorf for any tooling you may wish to use, but I’d start with the USE method which is an acronym for Utilization, Saturation and Errors.

https://www.brendangregg.com/usemethod.html

Basically, you’re going to need some tooling to pull metrics from your system to find what has high utilization and what’s saturated. We often promote XDMoD for these, though we use Prometheus as well at OSC with great affect.

But those are large systems which you may not have the staffing time to deploy so I’d throw out sar and top which can be effective in a pinch.

https://open.xdmod.org/10.5/index.html

Reading Brenden Greggs webpage a little more, I guess I’d throw perf into the mix as well.

I get sense that 80% of all HPC slowdowns are due to file systems and the network path they take. So I’d look at file IO on your NFS devices being saturated first.

If the slowdown affects all compute nodes, then you have to focus on things they share. Usually the only way the compute nodes could have the slowdown be local to the nodes and affect all of them at same time is if there is a cron job or systemd timer running around the same time on all nodes that affects performance.

If the issue isn’t some kind of scheduled task on compute, then storage or network as those would be the things shared between all the nodes presumably.

I’d recommend Prometheus and Grafana but those are pretty heavy lifts if you don’t already have them. Anything else would require being logged into the compute and NAS right when issue happens and running commands like “top” or “ps auxf” to look for load. There are other “top-like” programs like “htop” and “iftop” to look at processes and network.