The apps in OnDemand (v4.0.7, SLURM) fail to start intermittently with “Module not found” error, while module load on a compute node works fine. Can it be something related to OnDemand or its configuration?
I doubt this is OOD. This sounds more like something with lmod on the compute node and resource constraints as it tries to load the modules. Could be due to file mounting and latency, could be an issue with the script itself trying to load the modules before the environment is ready.
Is the module random or does it seem to be the same one or set of modules when this happens?
Are you using NSF or any type of shared file system and are there any issues with how it is performing?
Are you setting set -x in the script to capture all the information you can in the app’s session logs? You can get more targeted but that will capture everything to start, but adding things like module avail may help to see if what you are trying to load is even there to start.