OnDemand version: v1.6.22 | Dashboard version: v1.35.3
My sbatch job works fine on CLI:
#!/bin/bash
#SBATCH -N 2
#SBATCH --ntasks-per-node=1 # Num MPI process per node
#SBATCH --output=out.%j.txt
#SBATCH --error=err.%j.txt
#SBATCH --gres=gpu:1
NCCL_SOCKET_IFNAME=eth0 \
mpirun \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-x NCCL_SOCKET_IFNAME \
--mca btl tcp,self \
--mca pml ob1 \
--mca btl_tcp_if_include "eth0" \
--mca mpi_show_mca_params all \
singularity exec /mnt/shared/images/singularity/nc-tensorflow.sif python /home/updikca1/Horovod.MNIST.orig.py
But it fails when launched using the UI (sys/myjobs/workflows) Error is:
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
How can I go about debugging this?