Interactive app environment in 1.7

While testing OOD 1.7, I noticed that the interactive apps seem to be starting with limited environment, which is different than in 1.6.

In our case, we source the Lmod through a series of profile.d like scripts that are called from /etc/profile.d, shown below. In 1.6 case the module.sh would get sourced, while in 1.7 it does not. We do have below a condition that UID<500 does not source these files, so, any chance that the interactive app sessions start as non-user? I am not seeing why/how could that be but just want to make sure.

Or, any other thoughts?

Below is the profile.d files structure
cat /etc/profile.d/chpc.sh if [[ {UID} -ge 500 ]]
then
if [ -f /uufs/chpc.utah.edu/sys/etc/chpc.sh ]
then
source /uufs/chpc.utah.edu/sys/etc/chpc.sh
fi
fi
$ cat /uufs/chpc.utah.edu/sys/etc/chpc.sh
for i in /uufs/chpc.utah.edu/sys/etc/profile.d/.sh; do
if [ -r ā€œ$iā€ ]; then
if [ ā€œ$PS1ā€ ]; then
. ā€œ$iā€
else
. "i" >/dev/null 2>&1 fi fi done ls /uufs/chpc.utah.edu/sys/etc/profile.d/
.sh
/uufs/chpc.utah.edu/sys/etc/profile.d/module.sh
(this is a bit longer file that sets up the appropriate Lmod).

Thanks,
MC

I should add that we do have a workaround for that, to add this to the template/script.sh.erb:
if [ -z ā€œ$LMOD_VERSIONā€ ]; then
source /etc/profile.d/chpc.sh
fi

You should use script_wrapper or header in your cluster config instead of editing the template scripts (script wrapper needs the %s because it wraps the script and you can have things above or below it).

      batch_connect:
        vnc:
          header: "#!/bin/bash"
          script_wrapper: |
            if [ -z "$LMOD_VERSION" ]; then
              source /etc/profile.d/chpc.sh
            fi
            %s
        basic:
          # same result as above
          header: |
            #!/bin/bash
            if [ -z "$LMOD_VERSION" ]; then
              source /etc/profile.d/chpc.sh
            fi

As to the difference between 6 and 7, I canā€™t say off the top of my head why theyā€™d be different.

1 Like

Thanks Jeff, I did not think about the cluster configs.

Actually, @mcuma I can think of whatā€™s different now, especially as it relates to environment variables. We added something to copy_environment for all the schedulers.

In, say, SLURM itā€™s --export. Do you set and or use job_environment map? (I found through some utah docs that seem to indicate you use SLURM).

The behaviour now is, if you donā€™t use job_environment, it wonā€™t use the --export flag - which is what it did in 1.6.

If you do use job_environment then itā€™ll do --export=NONE,FOO,BAR if you donā€™t use copy_environment and --export=ALL,FOO,BAR if you do set copy_environment to true (I know that phrase has alot of ifs in it and may be convoluted).

This is the only thing I can think of. The sessions could not start as a non-user, OOD submits the job through slurm as the UID of the given user (unless you have some wrapper script in front of srun).

Hi Jeff,

I think you may be on the right track since the environment seems to be passed, but, the Lmod ā€œmoduleā€ command is an alias which probably does not pass. Though, the job should be opening a new terminal on the compute node and source the Lmod, which is what the standard SLURM job does. And there the alias is functional. So, itā€™s still looking like the /etc/profile.d/ā€¦ part is not being sourced.

I donā€™t think we do anything with the job environment, and the default should be --export=ALL

Is there any documentation on the job_environment?

Thanks,
MC

In updating from 1.6 to 1.7, I had to add sourcing the module environment setup .sh script to script_wrapper.

Can we add this to the Docs somewhere. In the release notes maybe? This burned me as well

Thanks,
Morgan

:frowning_face: Well thereā€™s clearly something awry here. Looking over the code again, the default SLURM behavior for job environments should be the same. That is, the commands executed should be the same as before, with the same environment.

@milberg & @mjbludwig do you also use SLURM?

Looking at SLURM help tickets I see stuff like this. The SLURM FAQ also says something similar.
The user environment is re-populated from a copy of the environment taken when the job was submitted through sbatch, with the SLURM_* environment variables added in to it.
So somehow we corrupted the environment when we run srun.

@jeff.ohrstrom Yes, we are using Slurm. Besides adding sourcing the module environment setup .sh script in the cluster config, I later found that I needed to add it to the Job Composer templates as well.

OK, we clearly broke previous behavior.

We used to set SBATCH_EXPORT=NONE which makes it unclear to me how slurm used to find this function definition? We now just use the --export argument, but thatā€™s only if there are other job_environment variables.

Thereā€™s a lot of talk in the tickets that generated this change, but itā€™s likely weā€™ll have to patch this as it looks like the previous behavior (however it happened) is expected and indeed a much better experience.

Glad I found this thread. We updated to 1.7.11 in production today and found that it broke all our jupyter apps. I had forgotten to test this in dev. It doesnā€™t know the ā€˜moduleā€™ command so nothing launches.

We will be fixing this in 1.7.13 so that the behavior is reverted. But the fixes in this PR should address the problem and be unaffected when 1.7.13 is released.

1 Like

We have an additional issue for our center in that users and groups create their own modules and source them in their .bashrc files. This work around provided doesnā€™t help for this issue. Any idea when youā€™ll be releasing version 1.7.13?

Thanks,
Dori

We have the patch in and weā€™re building now, so tomorrow we should be able to verify everything is working as it should.

Then weā€™ll be able to promote it to stable, so maybe tomorrow 5-28 and maybe Friday 5-29 depending (and itā€™s 1.7.14 now to also get a Safari/noVNC patch).

1 Like

Excellent! I can test this in our development setup too if that would help.

1.7.14 is currently built in the latest repo https://yum.osc.edu/ondemand/latest/web/el7/x86_64/

You can try this out on your development host by running:

yum install https://yum.osc.edu/ondemand/latest/ondemand-release-web-latest-1-6.noarch.rpm
yum clean all
yum update ondemand

You can undo this change by running

yum remove ondemand-release-web-latest
yum install https://yum.osc.edu/ondemand/1.7/ondemand-release-web-1.7-1.noarch.rpm
yum clean all
yum downgrade ondemand
2 Likes

Thanks! This worked well for us in dev so we put it in production this afternoon. Appreciate the fast response on fixing this!

This has been promoted to stable, so you donā€™t need to pull it off our latest repo anymore, you can get it from the regular 1.7 repo.

Sites that have implemented this fix should not have issues updating directly. There should be no issue sourcing a file a second time (if it gets sourced at all, with the if block there).

Hi Jeff,

does this update also include the Linux Host Adapter fixes that you did for us?

Also, would you mind checking Linux Host Adapter feedback and responding to the questions I have there?

Thanks,
MC