When checking the copy_environment field of job composer module: command not found

Hi

We install ood v3.1. In order to solve the following problems, check the copy_environment field of the job option.

An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).

When I submit the job, I get the following error. It seems that the environment variables have been cleared. How to solve this problem?

/var/spool/slurmd/job30416/slurm_script: line 15: module: command not found
/var/spool/slurmd/job30416/slurm_script: line 16: module: command not found
/var/spool/slurmd/job30416/slurm_script: line 27: mpiexec.hydra: command not found

Thank you

Hello and welcome!

Could you post the content of the submit.yml you are using. I’m assuming here you’ve set copy_environment: true? Is that correct?

Hi travert

Sorry, where can I find myjob’s submit.yml? (/var/www/ood/apps/sys/myjobs/)

I created a job in job composer, and then checked Copy environment in the job option (as shown in the picture).
image

Next, I executed the submit job, an error occurred: command not found

I did not set copy_environment: true in /etc/ood/config/clusters.d/xxx.yml, I don’t know where the setting is wrong.

Thank you!

That looks like an issue with either the compute node that the job was allocated or even your users .bashrc but not related to OOD.

The Job Composer is just a GUI interface for writing standard sbatch jobs.

Hi

In the SLURM adapter (ood_core-0.25.0/lib/ood_core/job/adapters/slurm.rb), there is a function export_arg that takes env and copy_environment as arguments. Here is the relevant code:

          # we default to export NONE, but SLURM defaults to ALL.
          # we do this bc SLURM setups a new environment, loading /etc/profile
          # and all giving 'module' function (among other things shells give),
          # where the PUN did not.
          # --export=ALL export the PUN's environment.
          def export_arg(env, copy_environment)
            if !env.empty? && !copy_environment
              env.keys.join(",")
            elsif !env.empty? && copy_environment
              "ALL," + env.keys.join(",")
            elsif env.empty? && copy_environment
              # only this option changes behaivor dramatically
              "ALL"
            else
              "NONE"
            end
          end

The comments in the code mention that when copy_environment is set to true, the PUN’s (Portal User Environment) environment is exported. The comments also mention that the PUN does not run /etc/profile.

I have an LMOD variable set in /etc/profile.d/. Based on the comments and my understanding, if copy_environment is enabled, this LMOD variable will not be available in my SLURM job environment because the PUN environment does not run /etc/profile.

In my experiments, I found that by adding source /etc/profile to my script, the LMOD variable works as expected in the SLURM job environment. Here is what I added to my script:

#!/bin/bash
#SBATCH -J openmpi-hello-world                              # Job name
#SBATCH --partition development
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=112
#SBATCH -o %j_openmpi-4.1.6-gcc-10.4.0.log                   # Path to the standard output and error files relative to the working directory
#SBATCH -e %j_openmpi-4.1.6-gcc-10.4.0.err                  # Path to the standard output and error files relative to the working directory

source /etc/profile
env
export UCX_TLS=ud,dc,rc,self
export OMPI_MCA_btl=tcp,self
export OMPI_MCA_pml=ucx
export UCX_NET_DEVICES=mlx5_0:1

GCC_VERSION="10.4.0"
OPENMPI_VERSION="4.1.6"

module purge
module load gcc/"$GCC_VERSION" openmpi/"$OPENMPI_VERSION"

mpirun hello-world_openmpi-"$OPENMPI_VERSION"-gcc-"$GCC_VERSION"

I would appreciate if someone could confirm my understanding or correct me if I’m wrong. Any additional insights or suggestions are also welcome.

Thank you in advance for your help.

Best regards,
Frank

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.