OOD is not functioning. Application launch buttons won't show unless clicking on the hyperlink of sessionIDs

wfeinstein · October 30, 2024, 4:13pm

Hi Jeff,

Last week, our VAST home file system had issues bringing down the entire cluster. The cluster is back in service; however, OpenOnDemand behaves wired and nonfunctional. I was on vacation then, so I am unsure exactly what the system admin did to return the OOD service.

Normally, after “Your session is currently starting… Please be patient as this process can take a few minutes.”, a job session will start soon. However, the session buttons will not appear unless the hyperlink of the sessionID is clicked, which displays the content of the log directory. Otherwise, nothing happens, whereas the message “Your session is currently starting… Please be patient as this process can take a few minutes.” stays in place with endless waiting.

We can’t release the Open OnDemand service. Users are waiting. Please advise where I need to look. May I request a help session over Zoom?

Thank you,
Wei

mjbludwig · October 30, 2024, 8:14pm

I take it you have restarted httpd? Or even rebooted the host?

jeff.ohrstrom · October 30, 2024, 9:01pm

Sorry for the delay. I’m a bit swamped for meetings, but if you emailed me directly with a time I could at least respond with a time that could work.

Restarts never hurt. Sounds like your storage systems went down - I wonder if the webnode was ever able to reconnect. It seems to me like you’re able to schedule the job and it starts running. The job however, writes the connection file in that staged directory. It seems to me that this file write isn’t propogating back to the webnode where it can be read.

So I’d suggest bouncing the web node entirely so maybe that the file writes on the compute nodes can start to propogate back to the web node.

I’m not by any stretch an expert in network file systems, but my guess is is that the file writes happning on the compute nodes aren’t being propagated back to the web node.

This is the question I would ask - when the job writes connection.yml why does this file not show up on the web node?

wfeinstein · October 30, 2024, 10:30pm

Morgan,

Yes, the OOD server was restarted. However not all the compute nodes were rebooted, which could be the issue.

mjbludwig · October 30, 2024, 10:57pm

Hi Wei,

I want to be clear that I do not represent OSC or OnDemand but have deployed it a lot and like to try and give at least some help in the forums when possible so by all means Jeff is the expert here.

OOD does rely on that /home/$USER/ondemand dir for each app sessions working environment so as Jeff suggested, rebooting the server/vm that OOD is running on is always a good first step to make sure you’re underlying autofs or otherwise mounted /home are sturdy. I manage a lot of clusters, including ones who house /home on VAST and any sort of latency or being out of sync can really mess with things.

I take it the VAST system is back up and has been checked on all nodes in the cluster? VAST shares can be deployed various ways so it would be helpful to know how you folks are doing it. In the most simple form, it can effectively be just an NFS export but for much better performance you can build their kernel modules against mofed (if you are an InfiniBand primary cluster) but also against the kernel even if you do not have IB. Are you an IB shop or just over regular tcp/ip?

A first guess might be that the latency in creating the connection.yml file is not beating the underlying OOD rails app’s need for it and as a result you are hitting a race condition in a sense. Resulting in the interactive app card not having the information it needs to populate the html, etc.

alanc · October 30, 2024, 11:13pm

Morgan:

On behalf of OSC / Open OnDemand, we are extremely grateful you are pitching in to help and welcome any and all assistance you are willing to provide to the community!

wfeinstein · October 31, 2024, 2:19pm

Thank you all for sharing your thoughts and ideas!

Once you pointed connection.yml, I noticed for some jobs, the file was either missing in the staged dir, which caused jobs to be terminated right away, or the file was generated yet can’t be propagated back to the login node, causing “stuck”. We don’t have a dedicated sever as OOD login (webnode). In the meantime, some jobs run as well as expected.

The system admins only rebooted some compute nodes to bring the clusters back online sooner, which probably explains the weird OOD behavior.

Thank you,
Wei

system · April 29, 2025, 2:20pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
LSF multi-cluster environment deleting panel Get Help question	27	3839	May 26, 2022
Interactive desktop with OOD not running on cluster Get Help	12	846	September 2, 2023
Open OnDemand Beginner Question Get Help	5	1447	May 26, 2022
BUG - Updated OnDemand - Interactive Apps "Spill" over to other interactive apps on submission Get Help question	25	2067	May 26, 2022
Reinstall ood dashboard can not start Get Help	8	987	May 26, 2022

OOD is not functioning. Application launch buttons won't show unless clicking on the hyperlink of sessionIDs

Related topics