Interactive app environment in 1.7

mcuma · April 29, 2020, 4:42pm

While testing OOD 1.7, I noticed that the interactive apps seem to be starting with limited environment, which is different than in 1.6.

In our case, we source the Lmod through a series of profile.d like scripts that are called from /etc/profile.d, shown below. In 1.6 case the module.sh would get sourced, while in 1.7 it does not. We do have below a condition that UID<500 does not source these files, so, any chance that the interactive app sessions start as non-user? I am not seeing why/how could that be but just want to make sure.

Or, any other thoughts?

Below is the profile.d files structure
cat /etc/profile.d/chpc.sh if [[ {UID} -ge 500 ]]
then
if [ -f /uufs/chpc.utah.edu/sys/etc/chpc.sh ]
then
source /uufs/chpc.utah.edu/sys/etc/chpc.sh
fi
fi
$ cat /uufs/chpc.utah.edu/sys/etc/chpc.sh
for i in /uufs/chpc.utah.edu/sys/etc/profile.d/.sh; do
if [ -r “$i” ]; then
if [ “$PS1” ]; then
. “$i”
else
. "i" >/dev/null 2>&1 fi fi done ls /uufs/chpc.utah.edu/sys/etc/profile.d/.sh
/uufs/chpc.utah.edu/sys/etc/profile.d/module.sh
(this is a bit longer file that sets up the appropriate Lmod).

Thanks,
MC

mcuma · April 29, 2020, 4:44pm

I should add that we do have a workaround for that, to add this to the template/script.sh.erb:
if [ -z “$LMOD_VERSION” ]; then
source /etc/profile.d/chpc.sh
fi

jeff.ohrstrom · April 29, 2020, 5:19pm

You should use script_wrapper or header in your cluster config instead of editing the template scripts (script wrapper needs the %s because it wraps the script and you can have things above or below it).

      batch_connect:
        vnc:
          header: "#!/bin/bash"
          script_wrapper: |
            if [ -z "$LMOD_VERSION" ]; then
              source /etc/profile.d/chpc.sh
            fi
            %s
        basic:
          # same result as above
          header: |
            #!/bin/bash
            if [ -z "$LMOD_VERSION" ]; then
              source /etc/profile.d/chpc.sh
            fi

As to the difference between 6 and 7, I can’t say off the top of my head why they’d be different.

mcuma · April 29, 2020, 9:17pm

Thanks Jeff, I did not think about the cluster configs.

jeff.ohrstrom · May 1, 2020, 9:41pm

Actually, @mcuma I can think of what’s different now, especially as it relates to environment variables. We added something to copy_environment for all the schedulers.

In, say, SLURM it’s --export. Do you set and or use job_environment map? (I found through some utah docs that seem to indicate you use SLURM).

The behaviour now is, if you don’t use job_environment, it won’t use the --export flag - which is what it did in 1.6.

If you do use job_environment then it’ll do --export=NONE,FOO,BAR if you don’t use copy_environment and --export=ALL,FOO,BAR if you do set copy_environment to true (I know that phrase has alot of ifs in it and may be convoluted).

This is the only thing I can think of. The sessions could not start as a non-user, OOD submits the job through slurm as the UID of the given user (unless you have some wrapper script in front of srun).

mcuma · May 1, 2020, 10:03pm

Hi Jeff,

I think you may be on the right track since the environment seems to be passed, but, the Lmod “module” command is an alias which probably does not pass. Though, the job should be opening a new terminal on the compute node and source the Lmod, which is what the standard SLURM job does. And there the alias is functional. So, it’s still looking like the /etc/profile.d/… part is not being sourced.

I don’t think we do anything with the job environment, and the default should be --export=ALL

Is there any documentation on the job_environment?

Thanks,
MC

milberg · May 5, 2020, 11:57am

In updating from 1.6 to 1.7, I had to add sourcing the module environment setup .sh script to script_wrapper.

mjbludwig · May 21, 2020, 5:33pm

Can we add this to the Docs somewhere. In the release notes maybe? This burned me as well

Thanks,
Morgan

jeff.ohrstrom · May 21, 2020, 8:12pm

Well there’s clearly something awry here. Looking over the code again, the default SLURM behavior for job environments should be the same. That is, the commands executed should be the same as before, with the same environment.

@milberg & @mjbludwig do you also use SLURM?

Looking at SLURM help tickets I see stuff like this. The SLURM FAQ also says something similar.
The user environment is re-populated from a copy of the environment taken when the job was submitted through sbatch, with the SLURM_* environment variables added in to it.
So somehow we corrupted the environment when we run srun.

milberg · May 21, 2020, 8:45pm

@jeff.ohrstrom Yes, we are using Slurm. Besides adding sourcing the module environment setup .sh script in the cluster config, I later found that I needed to add it to the Job Composer templates as well.

jeff.ohrstrom · May 21, 2020, 10:36pm

OK, we clearly broke previous behavior.

We used to set SBATCH_EXPORT=NONE which makes it unclear to me how slurm used to find this function definition? We now just use the --export argument, but that’s only if there are other job_environment variables.

There’s a lot of talk in the tickets that generated this change, but it’s likely we’ll have to patch this as it looks like the previous behavior (however it happened) is expected and indeed a much better experience.

dsajdak · May 27, 2020, 1:00am

Glad I found this thread. We updated to 1.7.11 in production today and found that it broke all our jupyter apps. I had forgotten to test this in dev. It doesn’t know the ‘module’ command so nothing launches.

efranz · May 27, 2020, 12:00pm

We will be fixing this in 1.7.13 so that the behavior is reverted. But the fixes in this PR should address the problem and be unaffected when 1.7.13 is released.

dsajdak · May 27, 2020, 7:29pm

We have an additional issue for our center in that users and groups create their own modules and source them in their .bashrc files. This work around provided doesn’t help for this issue. Any idea when you’ll be releasing version 1.7.13?

Thanks,
Dori

jeff.ohrstrom · May 27, 2020, 10:35pm

We have the patch in and we’re building now, so tomorrow we should be able to verify everything is working as it should.

Then we’ll be able to promote it to stable, so maybe tomorrow 5-28 and maybe Friday 5-29 depending (and it’s 1.7.14 now to also get a Safari/noVNC patch).

dsajdak · May 28, 2020, 12:36pm

Excellent! I can test this in our development setup too if that would help.

efranz · May 28, 2020, 5:08pm

1.7.14 is currently built in the latest repo https://yum.osc.edu/ondemand/latest/web/el7/x86_64/

You can try this out on your development host by running:

yum install https://yum.osc.edu/ondemand/latest/ondemand-release-web-latest-1-6.noarch.rpm
yum clean all
yum update ondemand

You can undo this change by running

yum remove ondemand-release-web-latest
yum install https://yum.osc.edu/ondemand/1.7/ondemand-release-web-1.7-1.noarch.rpm
yum clean all
yum downgrade ondemand

dsajdak · May 28, 2020, 9:13pm

Thanks! This worked well for us in dev so we put it in production this afternoon. Appreciate the fast response on fixing this!

jeff.ohrstrom · May 29, 2020, 1:52pm

This has been promoted to stable, so you don’t need to pull it off our latest repo anymore, you can get it from the regular 1.7 repo.

Sites that have implemented this fix should not have issues updating directly. There should be no issue sourcing a file a second time (if it gets sourced at all, with the if block there).

mcuma · May 29, 2020, 3:40pm

Hi Jeff,

does this update also include the Linux Host Adapter fixes that you did for us?

Also, would you mind checking Linux Host Adapter feedback and responding to the questions I have there?

Thanks,
MC

Topic		Replies	Views
Interactive app -Desktop : can not find command Get Help ondemand2 , question	11	1006	December 11, 2022
Interactive jobs started inside Interactive Desktop don't export environment variables (Slurm) Get Help question	8	2301	September 10, 2022
Troubleshooting Interactive Desktops Get Help question	6	292	April 14, 2024
SLURM job environment Get Help	4	2535	May 26, 2022
When checking the copy_environment field of job composer module: command not found Get Help question	5	152	November 25, 2024

Interactive app environment in 1.7

Related topics