We’ve pivoted to implement “configless Slurm”. Just recently, we updated the head nodes to run the limited slurmd, so that the head nodes (as well as the compute nodes) would communicate to the controller to access the latest configuration information (slurm/gres.conf)
What is the best practice now for the Cluster configuration entry?
I had not altered the standard Slurm inputs to the /etc/ood/config/cluster.d/.yml
The main question is whether to retain or alter the ‘conf:’ entry. I reviewed the docs and do see that this is an optional field. Does it become problematic when operating Slurm ‘configless’ (which is default since Slurm 20.02, I recently learned)? Are there any reasons to continue to specify the ‘conf’ field in the case of relying on slurmd to communicate to the controller for the config info?
I’m not sure myself as I’m not knowledgeable about SLURM itself and this new change. First I’ve heard about it.
@tdockendorf do you have any pointers on this? I don’t know how that is setup at OSC or if we use this same setup.
I don’t believe configless is default as it still requires being enabled. Slurm Workload Manager - slurm.conf - opt-in there. I Think the default you are referring to is that it’s “build in by default” but not enabled by default - Slurm Workload Manager -.
OnDemand just executes Slurm commands so things like “sbatch” and “squeue” so the ability for those commands to find their configs will still work if you omit the “conf” setting for OnDemand.
FWIW OSC runs configless but only for our HPC systems. Our web nodes which run OnDemand don’t use configless because we need our OnDemand nodes to talk to multiple clusters so we actually deploy things like “/etc/slurm/slurm-$cluster.conf” and point each cluster YAML at the specific conf file which allows to force OnDemand to not use “-M $cluster” which can cause issues when SlurmDBD is offline or down for maintenance.
But if the conf field is set, does it take precedence over other methods to provide the config info in use by the system? So, can having the conf: present in a cluster configuration but incorrect (any path at all, in the case of ‘configless’ slurm) be disruptive to the launching of interactive apps?
I am now convinced that we should no longer set it in ood cluster configure, and we will stop. I suppose time will tell if we still encounter the errors that arose following our change a couple of days ago to remove the static configuration files (/etc/slurm/slurm.conf) from our login nodes.
If you set
conf it will instruct OnDemand to set the
SLURM_CONF environment variable to that path when executing Slurm commands. The docs on configless list the order of how configs are found, so this environment variable, if set to the incorrect path, will prevent the configless default paths from being seen. So if /etc/slurm/slurm.conf no longer applies to your deployment, it will need to be removed from the cluster YAML or updated to point at the configless location in /run.
Thanks, Trey, for explaining how OnDemand implements the ‘conf’ setting. And for the interpretation of the hierarchy provided in the Slurm notes on configless. I’ll discuss the implementation of configless within our team, now with a better understanding of the role played by OnDemand.