Launch Scripts: Trap Errors

In the workflow for the launcher scripts, what is the best point to trap and generate an error message for the user. Up until the point of the script getting a scheduler allocation and launching, we have all see the red box up at the top.

For all our various launchers, jupyter, matlab, etc. I would like to codify and test for all the typical things that can go wrong on an HPC cluster. Usually these are periodically monitored by something like nagios, icinga but interactive users (ssh or OOD) seem to notice things sooner. Testing/trapping errors will give a better experience.

For example, if a software repo is offline, license is unavailable or full, home directory is full, parallel file system out of quota, path unavailabe, etc. could be tested for and an appropriate message output. Anything that would prevent the app from launching or if it does, not working after launch.

Ideas?

I would need to fix https://github.com/OSC/ood-dashboard/issues/451 but then you could add a block of ruby code to an interactive app’s submit.yml.erb that did the tests (or executed an external script that did the tests) and raised an exception with an explanation.

Right now if you raise an exception in submit.yml.erb it will be unhandled unfortunately, and if you raise an exception in an erb file in the template directory of an interactive app it will display to the user but files will be copied to the user’s directory that are never used.

So until this is fixed you could place an erb file in the template directory and raise an exception. It would not be ideal for a case where home directory is full. But for that case there is always https://osc.github.io/ood-documentation/master/customization.html#disk-quota-warnings-on-dashboard.

I can add this issue to a list of 1.5 bug fixes so when we fix these we can just do a 1.5 patch release so it is easier to get the fix.

Another option is a monkey patch that overrides the default BatchConnect::Session#save method https://github.com/OSC/ood-dashboard/blob/baea5558a4e55e8f7ca49599670e6731ad2da94b/app/models/batch_connect/session.rb#L137. Then you could place that in a custom dashboard initializer in /etc/ood/config and that would just affect all of the interactive apps. Monkey patches are of course brittle solutions though.

Worth stating that the solutions I’ve mentioned thus far are just making use of what is available right now or with minor fixes to enable validations on submitting the form. There may be a more appropriate way that could also leverage Rails’ built in model validations, but that would take more thought.

Thanks for all the ideas on how to deal with this. I’ll try something out in my free time or if it is fixed before hand as a bug fix (or even new feature), all the better.

This seems to work as expected now. I was able to make this simple test where I just raise an error unless a temp file exists, so you could likely make all sorts of tests here.

<%-

  raise StandardError, "This is just a test!" unless File.exist?("/tmp/error_file")

-%>
---