We install ood v3.1. In order to solve the following problems, check the copy_environment field of the job option.
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
When I submit the job, I get the following error. It seems that the environment variables have been cleared. How to solve this problem?
/var/spool/slurmd/job30416/slurm_script: line 15: module: command not found
/var/spool/slurmd/job30416/slurm_script: line 16: module: command not found
/var/spool/slurmd/job30416/slurm_script: line 27: mpiexec.hydra: command not found
In the SLURM adapter (ood_core-0.25.0/lib/ood_core/job/adapters/slurm.rb), there is a function export_arg that takes env and copy_environment as arguments. Here is the relevant code:
# we default to export NONE, but SLURM defaults to ALL.
# we do this bc SLURM setups a new environment, loading /etc/profile
# and all giving 'module' function (among other things shells give),
# where the PUN did not.
# --export=ALL export the PUN's environment.
def export_arg(env, copy_environment)
if !env.empty? && !copy_environment
env.keys.join(",")
elsif !env.empty? && copy_environment
"ALL," + env.keys.join(",")
elsif env.empty? && copy_environment
# only this option changes behaivor dramatically
"ALL"
else
"NONE"
end
end
The comments in the code mention that when copy_environment is set to true, the PUN’s (Portal User Environment) environment is exported. The comments also mention that the PUN does not run /etc/profile.
I have an LMOD variable set in /etc/profile.d/. Based on the comments and my understanding, if copy_environment is enabled, this LMOD variable will not be available in my SLURM job environment because the PUN environment does not run /etc/profile.
In my experiments, I found that by adding source /etc/profile to my script, the LMOD variable works as expected in the SLURM job environment. Here is what I added to my script:
#!/bin/bash
#SBATCH -J openmpi-hello-world # Job name
#SBATCH --partition development
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=112
#SBATCH -o %j_openmpi-4.1.6-gcc-10.4.0.log # Path to the standard output and error files relative to the working directory
#SBATCH -e %j_openmpi-4.1.6-gcc-10.4.0.err # Path to the standard output and error files relative to the working directory
source /etc/profile
env
export UCX_TLS=ud,dc,rc,self
export OMPI_MCA_btl=tcp,self
export OMPI_MCA_pml=ucx
export UCX_NET_DEVICES=mlx5_0:1
GCC_VERSION="10.4.0"
OPENMPI_VERSION="4.1.6"
module purge
module load gcc/"$GCC_VERSION" openmpi/"$OPENMPI_VERSION"
mpirun hello-world_openmpi-"$OPENMPI_VERSION"-gcc-"$GCC_VERSION"
I would appreciate if someone could confirm my understanding or correct me if I’m wrong. Any additional insights or suggestions are also welcome.