Hey guys,
I guess this is a continuing issue and I’m not entirely sure how to approach debugging. As you know I posted a question regarding Jupyter app failing to start waiting for a server port that we could resolve by clearing out the web browser cache.
My RStudio Server, which was working last week just fine, has stopped working and keeps timing out waiting for a port to open. Our main cluster is split into two main nodes: epyc, and gpu. This is only happening on my “epyc” nodes, it works fine in GPU nodes. It’s the exact same script being run so I’m confused.
I actually managed to solve it. I took a look at all the output.log files and found it was trying to start up on one particular host, epyc012. That host is running a few dozen 2-cpu jobs. I added a --exclude= and jobs are running again.
Hi Gerald, Sorry I missed your follow up question.
In that instance, slurm just kept trying to schedule the app as it had space for it on the node in question. Most users were able to start up apps after waiting 30 minutes or so until slurm scheduled them on a different node.
It wasn’t limited to one node, that was just the first time I ran into the problem. I’m posting a General Discussion to expand on this a bit. We have some jobs being submitted that are I/O intensive for either memroy or storage, sometimes both. These cause enough of a delay that port assignments for an interactive app times out. I’ve played with the timeout wait value a bit, but it’s hit and miss.