While testing OOD 1.7, I noticed that the interactive apps seem to be starting with limited environment, which is different than in 1.6.
In our case, we source the Lmod through a series of profile.d like scripts that are called from /etc/profile.d, shown below. In 1.6 case the module.sh would get sourced, while in 1.7 it does not. We do have below a condition that UID<500 does not source these files, so, any chance that the interactive app sessions start as non-user? I am not seeing why/how could that be but just want to make sure.
Or, any other thoughts?
Below is the profile.d files structure cat /etc/profile.d/chpc.sh
if [[ {UID} -ge 500 ]]
then
if [ -f /uufs/chpc.utah.edu/sys/etc/chpc.sh ]
then
source /uufs/chpc.utah.edu/sys/etc/chpc.sh
fi
fi
$ cat /uufs/chpc.utah.edu/sys/etc/chpc.sh
for i in /uufs/chpc.utah.edu/sys/etc/profile.d/.sh; do
if [ -r ā$iā ]; then
if [ ā$PS1ā ]; then
. ā$iā
else
. "i" >/dev/null 2>&1
fi
fi
done
ls /uufs/chpc.utah.edu/sys/etc/profile.d/.sh
/uufs/chpc.utah.edu/sys/etc/profile.d/module.sh
(this is a bit longer file that sets up the appropriate Lmod).
I should add that we do have a workaround for that, to add this to the template/script.sh.erb:
if [ -z ā$LMOD_VERSIONā ]; then
source /etc/profile.d/chpc.sh
fi
You should use script_wrapper or header in your cluster config instead of editing the template scripts (script wrapper needs the %s because it wraps the script and you can have things above or below it).
batch_connect:
vnc:
header: "#!/bin/bash"
script_wrapper: |
if [ -z "$LMOD_VERSION" ]; then
source /etc/profile.d/chpc.sh
fi
%s
basic:
# same result as above
header: |
#!/bin/bash
if [ -z "$LMOD_VERSION" ]; then
source /etc/profile.d/chpc.sh
fi
As to the difference between 6 and 7, I canāt say off the top of my head why theyād be different.
Actually, @mcuma I can think of whatās different now, especially as it relates to environment variables. We added something to copy_environment for all the schedulers.
In, say, SLURM itās --export. Do you set and or use job_environment map? (I found through some utah docs that seem to indicate you use SLURM).
The behaviour now is, if you donāt use job_environment, it wonāt use the --export flag - which is what it did in 1.6.
If you do use job_environment then itāll do --export=NONE,FOO,BAR if you donāt use copy_environment and --export=ALL,FOO,BAR if you do set copy_environment to true (I know that phrase has alot of ifs in it and may be convoluted).
This is the only thing I can think of. The sessions could not start as a non-user, OOD submits the job through slurm as the UID of the given user (unless you have some wrapper script in front of srun).
I think you may be on the right track since the environment seems to be passed, but, the Lmod āmoduleā command is an alias which probably does not pass. Though, the job should be opening a new terminal on the compute node and source the Lmod, which is what the standard SLURM job does. And there the alias is functional. So, itās still looking like the /etc/profile.d/ā¦ part is not being sourced.
I donāt think we do anything with the job environment, and the default should be --export=ALL
Is there any documentation on the job_environment?
Well thereās clearly something awry here. Looking over the code again, the default SLURM behavior for job environments should be the same. That is, the commands executed should be the same as before, with the same environment.
Looking at SLURM help tickets I see stuff like this. The SLURM FAQ also says something similar. The user environment is re-populated from a copy of the environment taken when the job was submitted through sbatch, with the SLURM_* environment variables added in to it.
So somehow we corrupted the environment when we run srun.
@jeff.ohrstrom Yes, we are using Slurm. Besides adding sourcing the module environment setup .sh script in the cluster config, I later found that I needed to add it to the Job Composer templates as well.
We used to set SBATCH_EXPORT=NONE which makes it unclear to me how slurm used to find this function definition? We now just use the --export argument, but thatās only if there are other job_environment variables.
Thereās a lot of talk in the tickets that generated this change, but itās likely weāll have to patch this as it looks like the previous behavior (however it happened) is expected and indeed a much better experience.
Glad I found this thread. We updated to 1.7.11 in production today and found that it broke all our jupyter apps. I had forgotten to test this in dev. It doesnāt know the āmoduleā command so nothing launches.
We will be fixing this in 1.7.13 so that the behavior is reverted. But the fixes in this PR should address the problem and be unaffected when 1.7.13 is released.
We have an additional issue for our center in that users and groups create their own modules and source them in their .bashrc files. This work around provided doesnāt help for this issue. Any idea when youāll be releasing version 1.7.13?
We have the patch in and weāre building now, so tomorrow we should be able to verify everything is working as it should.
Then weāll be able to promote it to stable, so maybe tomorrow 5-28 and maybe Friday 5-29 depending (and itās 1.7.14 now to also get a Safari/noVNC patch).
This has been promoted to stable, so you donāt need to pull it off our latest repo anymore, you can get it from the regular 1.7 repo.
Sites that have implemented this fix should not have issues updating directly. There should be no issue sourcing a file a second time (if it gets sourced at all, with the if block there).