I guess this is a continuing issue and I’m not entirely sure how to approach debugging. As you know I posted a question regarding Jupyter app failing to start waiting for a server port that we could resolve by clearing out the web browser cache.
My RStudio Server, which was working last week just fine, has stopped working and keeps timing out waiting for a port to open. Our main cluster is split into two main nodes: epyc, and gpu. This is only happening on my “epyc” nodes, it works fine in GPU nodes. It’s the exact same script being run so I’m confused.
I’m going to start by trying the over-simplified. What happens if you restart the epyc node?
No possible, every node (we have 20 epyc nodes) is running jobs.
I’m not showing any stale or zombie processes holding ports open across the cluster.
I’ve done a search within Discourse. There are some articles that may help you.
Can you please take a look at the results of this search, and let us know if any of these solutions help?
Been reading those for days.
I actually managed to solve it. I took a look at all the output.log files and found it was trying to start up on one particular host, epyc012. That host is running a few dozen 2-cpu jobs. I added a --exclude= and jobs are running again.
Sorry for the false alarm guys.
Very cool. Do you know why it kept hitting the one host?
Hi Gerald, Sorry I missed your follow up question.
In that instance, slurm just kept trying to schedule the app as it had space for it on the node in question. Most users were able to start up apps after waiting 30 minutes or so until slurm scheduled them on a different node.
It wasn’t limited to one node, that was just the first time I ran into the problem. I’m posting a General Discussion to expand on this a bit. We have some jobs being submitted that are I/O intensive for either memroy or storage, sometimes both. These cause enough of a delay that port assignments for an interactive app times out. I’ve played with the timeout wait value a bit, but it’s hit and miss.
This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.