Hey Guys,
I’m marking this as General Discussion as I’m fishing for administrative experience information.
This post is a continuation of the port timeouts I posted a while back. There are some workloads being deployed on our cluster that are proving to be intense I/O for either memory or storage, sometimes both. These jobs are hard to predict. They come from multiple groups/accounts so I can’t isolate them to a reservation or partition. Most come from customers who provided funding for their priority access. This causes a serious delay and lag when starting an interactive app from OOD, the port assignment times out. I’ve played with the port delay to give it more time but it proves to be a hit and miss.
Our current solution is creating a new partition dedicated for Interactive apps with dedicated nodes to serve that partition. We are limiting the cpu and memory resources per user. If anybody needs excessive resources (e.g. 64+ cpus or 256+ GB) they’ll be directed to our priority partition. This seems to be a good interim solution.
I have a different idea which should prove to be challenging. Say we start a Jupyter/RStudio instance and it failed to receive a server port assignment via timeout. Is there is a way to reset and resubmit the job (and add --exclude=“failed node” to the slurm submit.yml.erb), using the timeout as an indication the server is not currently useful for an interactive app.
I’m curious to find out if anybody else experienced port assignment delays and am definitely interested in hearing about any solutions they have implemented.
Kenny
SysAdmin, MSU RCI