Last week we encountered an issue with our OnDemand server where each user who logged in would get the following error:
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request
Reason: Error reading from remote server
We aren’t sure exactly what may have triggered it. This is what we see in the logs:
[Mon May 22 11:07:42.938803 2023] [proxy:error] [pid 3295] [client IP:52466] AH00898: Error reading from remote server returned by /pun/sys/dashboard, referer: https://adfs.edu/
[Mon May 22 11:07:42.938716 2023] [proxy_http:error] [pid 3295] (70007)The timeout specified has expired: [client IP] AH01102: error reading status line from remote server httpd-UDS:0, referer: https://adfs.edu/
This is what we have in ood_portal.yml other than the servername and ssl:
Is there a way we can extend this timeout? What might be the cause of this? We don’t see any other issues in the logs. In an effort to get back up and running we restored from a backup, but are having some issues with that restored instance as well, so resolving this may be the most ideal. Thank you!
I’d check in /var/log/ondemand-nginx/$USER/error.log - it appears that you’ve authenticated and we’re trying to start the PUN and can’t communicate with it. It’s a 60 second timeout, so it should be enough.
You can set the timeout through this apache configuration. Just drop a timeout.conf in your conf.d directory and it’ll be a global setting.
No other messages in /var/log/messages or journalctl unfortunately nor in the top-level error.log. We took a look again at the Apache logs and nothing else stands out except that original timeout error. We may see if we can try to get this up and running again with the restored version of the VM.
Strangely, this seems to have something to do with OnDemand 3.0, though we aren’t sure what. If we take our restored VM running OOD 2 we are able to log in without trouble, but if upgrade that same VM to 3.0 the exact same issue appears. We’re continuing to look into it.
We are seeing a similar issue running 3.0.1 on RHEL 8.5. So far I’ve only heard reports of this happening to one user. All I see in their /var/log/ondemand-nginx/$USER/error.log is the following:
[ E 2023-06-02 08:59:32.8974 882831/T2d age/Cor/App/Implementation.cpp:221 ]: Could not spawn process for application /var/www/ood/apps/sys/dashboard: A timeout occurred while spawning an application process.
Error ID: 4061c76a
Error details saved to: /tmp/passenger-error-vuoUnL.html
Per one of the earlier suggestions, we bumped the timeout in /etc/httpd/conf.d/timeout.conf to 120 seconds and it’s still timing out.
The only other thing I thought of was that this user had development enabled, and sometimes I’ve seen if a person has a “bad app” in dev, it can cause issues. But I had him move his dev apps out of his home directory and that didn’t fix it, so I don’t think it’s anything he’s done.
Other than that, I can say we are using CAS authentication. I’m working with our enterprise folks to make sure there are no errors on the CAS side, but I am able to successfully auth with my account, so my hunch is CAS is not the issue.
Happy to provide details from other log files if requested.
I guess our issue did have to do with having development enabled. The issue was isolated just to this one user, and when I removed his /var/www/ood/apps/dev/$USER/gateway link, he was able to get into the dashboard again. Then we added that link back in and everything is still working.
Not sure what to make of that, but hopefully it’s an additional data point at least.
We never did get more information on this issue but just wanted to add, I am happy to report that we started over and our fresh install of version 2.X is working fine. We are using ADFS authentication and the enterprise folks did not see anything out of the ordinary either. To add to the recent replies, we were not using selinux. On our side, we did have dev enabled but just for one user, but our issue was that suddenly all users reported getting the proxy error one afternoon without any changes to the server. We especially think it’s strange that we had this issue with OOD 3.0 on a fresh install, so it must be something on our end, our collective IT team just has no idea what. To be honest, we haven’t really looked into it more as it’s been working and other projects have taken bigger priority now. We may experiment with another 3.0 instance in the future.