Slurm job fails in OOD but works at CLI

OnDemand version: v1.6.22 | Dashboard version: v1.35.3

My sbatch job works fine on CLI:

#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=1 # Num MPI process per node
#SBATCH --output=out.%j.txt
#SBATCH --error=err.%j.txt
#SBATCH --gres=gpu:1

NCCL_SOCKET_IFNAME=eth0 \
    mpirun \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -x NCCL_SOCKET_IFNAME \
    --mca btl tcp,self \
    --mca pml ob1 \
    --mca btl_tcp_if_include "eth0" \
    --mca mpi_show_mca_params all \
    singularity exec /mnt/shared/images/singularity/nc-tensorflow.sif python /home/updikca1/Horovod.MNIST.orig.py

But it fails when launched using the UI (sys/myjobs/workflows) Error is:

--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------

How can I go about debugging this?

strace maybe? When you say you can run it from the cli, I’d ask if it’s the same host? The cli from the same server as OOD or some other login host? This could be your discrepancy, there could actually be some networking issue there.

It could also be the environment. OOD defaults to SBATCH_EXPORT=NONE so you can load a brand new environment. I would say add an env statement to see if there’s something you’re missing (some LD_LIBRARY_PATH missing?).

@cupdike were you able to figure out and/or resolve this error?

I’ll bet it’s due to OOD defaulting to SBATCH_EXPORT=NONE, I believe that has issues with parallelism. I wonder if adding #SBATCH --export=ALL directive to the script overrides this?