Internal Server Error on first login & interactive apps rsync input/output error issue

All,

Apologies for the posts recently as we work through issues and solve them, we are working on OnDemand integration with Duo/ADFS on a new VM and could use some guidance.

Everything seems to be working well except we have a strange issue, where the first time anyone logs in, they are presented with a page “Internal Server Error: The server encountered an internal error or misconfiguration and was unable to complete your request…”. As soon as the person refreshes the page, they are logged in without further issue. I do not see anything out of the ordinary in the logs. Is there something we can look at that might point to what the issue is?

Another user seems to be having an issue when launching interactive apps. They get:

rsync: mkstemp “/home/user/ondemand/data/sys/dashboard/batch_connect/sys/jupyterlab/output/6a9ca682-cf2b-4ab8-89ae-86681ec79cf6/.after.sh.HBJD5t” failed: Input/output error (5)
rsync: mkstemp “/home/user/ondemand/data/sys/dashboard/batch_connect/sys/jupyterlab/output/6a9ca682-cf2b-4ab8-89ae-86681ec79cf6/.before.sh.erb.oTykOr” failed: Input/output error (5)
rsync: mkstemp “/home/user/ondemand/data/sys/dashboard/batch_connect/sys/jupyterlab/output/6a9ca682-cf2b-4ab8-89ae-86681ec79cf6/.script.sh.erb.0PXcxp” failed: Input/output error (5)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2]

Others have the same issue, other users, like me, can launch the interactive apps just fine.

Any ideas on how to solve these two issues would be greatly appreciated.

No issues with the posts - we budget both in time and money for answering questions. So that’s no issue at all!

There should be something in apache’s error logs /var/log/httpd24-httpd for this. Whether it’s meaningful or not I can’t say but that’s our best bet to see any output there.

My guess is, there’s something funny with one of your HOME drives. At OSC our HOME drives are partitioned to maybe 4-5 storage nodes. One can blow up with a lot of lag and affect a certain number of folks but not others.

That is, my thinking is one of your storage devices is not working well, so it affects a subset of your users. Those users who happen to use that storage device.

There may be something in dmesg or journalctl about this, but I’m guessing you’ll have to run more tests on that file location with your storage folks.

Thanks Jeff for the reply. Good news is, we’ve resolved the issue with the input/output error. However, still getting the Internal Server Error just on the first login for people.

We’ve looked through the logs and really cannot see anything out of the ordinary that would cause this problem. We see the user map successfully, and that’s it, nothing more in the logs to note. It’s on debug as well so we should be getting all of the output. Do you know of anyone else who has seen this issue? A refresh of the page always fixes it, but it throws people off when they first see the error and they think the site is down.

That’s good to hear about the disk drives.

As to your auth issues, I’m not sure. Seems like something in the ADFS handshake flow is going wrong? Maybe you can try MellonDiagnosticsEnable and/or get some info from the browser.

This is an example of what I’d be looking for. From chrome, because chrome can capture stuff on redirects (though you may have to enable it?).

In any case, you can se I get a 302, 302, then I hit the dashboard with 200 response codes. I’m wondering what your flow looks like and where you’re 500 is coming from (which page/network call).