<App> failed to start waiting for port

kenny.hanson · February 20, 2023, 6:23pm

Hey guys,
I guess this is a continuing issue and I’m not entirely sure how to approach debugging. As you know I posted a question regarding Jupyter app failing to start waiting for a server port that we could resolve by clearing out the web browser cache.

My RStudio Server, which was working last week just fine, has stopped working and keeps timing out waiting for a port to open. Our main cluster is split into two main nodes: epyc, and gpu. This is only happening on my “epyc” nodes, it works fine in GPU nodes. It’s the exact same script being run so I’m confused.

Kenny

gbyrket · February 20, 2023, 6:37pm

Hey Kenny.

I’m going to start by trying the over-simplified. What happens if you restart the epyc node?

Thanks,
-gerald

kenny.hanson · February 20, 2023, 6:41pm

No possible, every node (we have 20 epyc nodes) is running jobs.

I’m not showing any stale or zombie processes holding ports open across the cluster.

Kenny

gbyrket · February 20, 2023, 7:18pm

Thanks Kenny.

I’ve done a search within Discourse. There are some articles that may help you.

Can you please take a look at the results of this search, and let us know if any of these solutions help?

https://discourse.openondemand.org/search?q=waiting%20for%20port

Thanks,
-gerald

kenny.hanson · February 20, 2023, 7:51pm

Been reading those for days.

I actually managed to solve it. I took a look at all the output.log files and found it was trying to start up on one particular host, epyc012. That host is running a few dozen 2-cpu jobs. I added a --exclude= and jobs are running again.

Sorry for the false alarm guys.

K

gbyrket · February 20, 2023, 8:15pm

Very cool. Do you know why it kept hitting the one host?

kenny.hanson · March 23, 2023, 4:02pm

Hi Gerald, Sorry I missed your follow up question.

In that instance, slurm just kept trying to schedule the app as it had space for it on the node in question. Most users were able to start up apps after waiting 30 minutes or so until slurm scheduled them on a different node.

It wasn’t limited to one node, that was just the first time I ran into the problem. I’m posting a General Discussion to expand on this a bit. We have some jobs being submitted that are I/O intensive for either memroy or storage, sometimes both. These cause enough of a delay that port assignments for an interactive app times out. I’ve played with the timeout wait value a bit, but it’s hit and miss.

system · September 19, 2023, 4:03pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Timed out waiting for Jupyter Notebook/permission denied Get Help	3	1996	May 26, 2022
Timed out waiting for RStudio Server to open port Get Help	3	1409	April 25, 2022
Node Administration for Interactive Apps General Discussion ondemand2 , question	1	316	March 23, 2023
Jupyter launch fails for UGE Get Help	5	820	May 26, 2022
Interactive sessions stuck at starting stage Get Help	6	1605	May 26, 2022

<App> failed to start waiting for port

Related topics