Node Administration for Interactive Apps

Hey Guys,
I’m marking this as General Discussion as I’m fishing for administrative experience information.

This post is a continuation of the port timeouts I posted a while back. There are some workloads being deployed on our cluster that are proving to be intense I/O for either memory or storage, sometimes both. These jobs are hard to predict. They come from multiple groups/accounts so I can’t isolate them to a reservation or partition. Most come from customers who provided funding for their priority access. This causes a serious delay and lag when starting an interactive app from OOD, the port assignment times out. I’ve played with the port delay to give it more time but it proves to be a hit and miss.

Our current solution is creating a new partition dedicated for Interactive apps with dedicated nodes to serve that partition. We are limiting the cpu and memory resources per user. If anybody needs excessive resources (e.g. 64+ cpus or 256+ GB) they’ll be directed to our priority partition. This seems to be a good interim solution.

I have a different idea which should prove to be challenging. Say we start a Jupyter/RStudio instance and it failed to receive a server port assignment via timeout. Is there is a way to reset and resubmit the job (and add --exclude=“failed node” to the slurm submit.yml.erb), using the timeout as an indication the server is not currently useful for an interactive app.

I’m curious to find out if anybody else experienced port assignment delays and am definitely interested in hearing about any solutions they have implemented.

SysAdmin, MSU RCI

Thanks for the post! Off the top of my head looking at this, I’d agree that using something in the submit.yml.erb would be the most straightforward way to handle this.

Something like adding a native under script that has the --exclude but uses some ERB and ruby to exclude nodes that are first queried on the back end and excluded if they fail the queries.

It may be best to have a wrapper script around this to loop the request and then just set that submit.yml.erb under script_wrapper even, though I am unsure if that setting can take a script or if it’s just commands it takes.

In any case, this can be done with some scripting and ERB, just have to find a sane route.