Last week, our VAST home file system had issues bringing down the entire cluster. The cluster is back in service; however, OpenOnDemand behaves wired and nonfunctional. I was on vacation then, so I am unsure exactly what the system admin did to return the OOD service.
Normally, after “Your session is currently starting… Please be patient as this process can take a few minutes.”, a job session will start soon. However, the session buttons will not appear unless the hyperlink of the sessionID is clicked, which displays the content of the log directory. Otherwise, nothing happens, whereas the message “Your session is currently starting… Please be patient as this process can take a few minutes.” stays in place with endless waiting.
We can’t release the Open OnDemand service. Users are waiting. Please advise where I need to look. May I request a help session over Zoom?
Sorry for the delay. I’m a bit swamped for meetings, but if you emailed me directly with a time I could at least respond with a time that could work.
Restarts never hurt. Sounds like your storage systems went down - I wonder if the webnode was ever able to reconnect. It seems to me like you’re able to schedule the job and it starts running. The job however, writes the connection file in that staged directory. It seems to me that this file write isn’t propogating back to the webnode where it can be read.
So I’d suggest bouncing the web node entirely so maybe that the file writes on the compute nodes can start to propogate back to the web node.
I’m not by any stretch an expert in network file systems, but my guess is is that the file writes happning on the compute nodes aren’t being propagated back to the web node.
This is the question I would ask - when the job writes connection.yml why does this file not show up on the web node?
I want to be clear that I do not represent OSC or OnDemand but have deployed it a lot and like to try and give at least some help in the forums when possible so by all means Jeff is the expert here.
OOD does rely on that /home/$USER/ondemand dir for each app sessions working environment so as Jeff suggested, rebooting the server/vm that OOD is running on is always a good first step to make sure you’re underlying autofs or otherwise mounted /home are sturdy. I manage a lot of clusters, including ones who house /home on VAST and any sort of latency or being out of sync can really mess with things.
I take it the VAST system is back up and has been checked on all nodes in the cluster? VAST shares can be deployed various ways so it would be helpful to know how you folks are doing it. In the most simple form, it can effectively be just an NFS export but for much better performance you can build their kernel modules against mofed (if you are an InfiniBand primary cluster) but also against the kernel even if you do not have IB. Are you an IB shop or just over regular tcp/ip?
A first guess might be that the latency in creating the connection.yml file is not beating the underlying OOD rails app’s need for it and as a result you are hitting a race condition in a sense. Resulting in the interactive app card not having the information it needs to populate the html, etc.
On behalf of OSC / Open OnDemand, we are extremely grateful you are pitching in to help and welcome any and all assistance you are willing to provide to the community!
Thank you all for sharing your thoughts and ideas!
Once you pointed connection.yml, I noticed for some jobs, the file was either missing in the staged dir, which caused jobs to be terminated right away, or the file was generated yet can’t be propagated back to the login node, causing “stuck”. We don’t have a dedicated sever as OOD login (webnode). In the meantime, some jobs run as well as expected.
The system admins only rebooted some compute nodes to bring the clusters back online sooner, which probably explains the weird OOD behavior.