Hi
Everyone. I run LSF v10 and ondemand-3.1.4-1.el8.x86_64. I can’t load a module (for this example we will use python) and execute it on the submit host. I can just run a basic batch code from the gui and execute it on the submit host. (example: echo “hello world!”). I have 2 problems. The environment won’t copy over, but I can bypass this by “copy environment” and it works. But more importantly it won’t execute the python script. It says it can’t find it. And the reason it can’t find it, is that it’s looking in my home directory to execute it but in reality it’s in that really long “script location”. Does anybody have any suggestions? Someone suggested to use submit.yml.erb. That’s fine but I just want to this simple thing to work universally.
Thanks,
–David
It seems like you’re using the job composer so the note about submit.yml.erb
only applies to batch connect applications.
When I’m in the job composer and click on Job Options
I can see there’s a dropdown menu for which script to run -
This is the script I’m trying to submit to the scheduler. Are you saying the python script is here in this dropdown and LSF can’t find it?
Or rather that you’re submitting a shell script that is then invoking your python script and it (the shell script) can’t find the python script?
Hi,
So let’s just assume submit.yml.erb doesn’t exist. Because really I made it and it doesn’t really work. So yes to your second question:
Or rather that you’re submitting a shell script that is then invoking your python script and it (the shell script) can’t find the python script?
Here is the example:
#!/bin/bash
#BSUB -q normal
#BSUB -J pythonjob #LSF Job Name
#BSUB -o pythonjob.%J.out #Name of the job output file
#BSUB -e pythonjob.%J.out #Name of the job error file
– send notification at start –
#BSUB -B
– send notification at completion –
#BSUB -N
module load python/3.10
python --version
python hello.py
Error is, if I say “copy environment” in the gui:
Python 3.10.5
python: can’t open file ‘/home/david/hello.py’: [Errno 2] No such file or directory
As you can see it did load the module but tried to execute the python script in /home/dweise
but the script location per the gui was:
/home/david/ondemand/data/sys/myjobs/projects/default/70
–David
OK thanks for that explaination. I can’t replicate on Slurm. We seem to be submitting the job from the directory of that contains the scripts and input.
So if I add things like this in my shell script (sequential_job.sh
)
echo "CWD is: $CWD"
echo "submit dir is $SLURM_SUBMIT_DIR"
echo "pwd is $(pwd)"
I get output like this:
CWD is:
submit dir is /users/PZS0714/johrstrom/ondemand/data/sys/myjobs/projects/default/34
pwd is /users/PZS0714/johrstrom/ondemand/data/sys/myjobs/projects/default/34
So the script should start with a pwd
of that script/job.
Looking at the code for copying environments - I see we’re passing the -env
flag will all the keys.
If I try to replicate on Slurm, and I dump my env
from withing the shell script - I see that my PWD
is always the same. Could it be that LSF is preserving this environment variable PWD
and Slurm is resetting it? (If LSF is preserving it, it’s likely /var/www/ood/apps/sys/dashboard
and since that doesn’t exist during the jobs execution it falls back to HOME
?)
Oddly enough - the documentation says that PWD
is not propagated to the job.
And just to be clear - when I tested with Slurm I tested both with and without the copy environment flag and got the same PWD
.
Hi,
Great Idea to include env. Summed up here are some of the env variables:
LS_EXECCWD=/home/david/
PWD=/home/dweise
Accoriding to IBM
- LS_EXECCWD: Sets the current working directory for job execution.
So, technically it’s working correctly. hello.py isn’t there. But how to change it that the script location is the same sas LS_EXECCWD.
Or in slurm case slurm_submit_dir = pwd
Where would you set that?
–David
Hi,
And I want to also point out that env had no mention what so ever of the real slurm_submit_dir, aka in the above message: /users/PZS0714/johrstrom/ondemand/data/sys/myjobs/projects/default/34
–David
Yea I think SLURM_SUBMIT_DIR
in Slurm is the same as LS_EXECCWD
in LSF.
Can we remove OOD from this for a second and run a test from the command line?
What I’d like to test is providing the full path of the script and the relative path. I’m not 100% how the environment fits into this -but I’m guessing it doesn’t at all. Rather that it’s a separate problem.
bsub /full/path/to/the_file.sh
and bsub the_file.sh
— I’d like to see how LSF behaves when you supply the full path (in any directory AND in the correct directory) vs when you are in the directory and submit the job with the relative path (note there’s no ./
there).
I checked our code base and we do change directory into the correct directory when submitting the job. We are however supplying the full path of the file - so I’m thinking that’s what’s throwing LSF off. If it were the relative path maybe LSF would behave differently.
Sorry I’m getting ahead of myself. Rereading the code (or scrutinizing it) we’re actually passing the script through stdin
. So to actually replicate the behavior here - you’d need to cat the_file.sh | bsub
. We do definitely change directories into the correct directory before submitting - we just read the file and pass that to bsub through stdin
.
Hi,
So the ondemand server isn’t a submit host. ondemand submits to the login host. Let’s call it vulcan. vulcan then does the submitting. So when I tell ondemand to open the directory it does so and i can do a bsub < batchsubscript.sh. And that will work. And it will dump the output into the “script location” directory.
–David
Hi,
Ok, but since the directory is dynamic, is there a variable for “script location”, so I could put in the code:
cd $SCRIPT_LOCATION
python ./hello.py
–David
Now it’s all starting to make sense. OK - well if you’re sshing into another machine, then you are in-fact submitting the job from your $HOME
and the fact that we cd
into the correct directory before submitting (or sshing to submit) does not do anything for you.
I don’t think so, because even if we set the environment on the OOD machine - we ssh somewhere else and thus get an entirely new environment.
I’ll need a minute to contemplate what solution could work for you. I’m not entirely sure right off the top how to deal with this. Obviously you can setup the OOD machine to be a submit host and I think stuff would start working for you there.
But if you must ssh to a submit_host then I’ll need to think of a workaround for you.
Hi,
Well it used to be a submit host but when you login it loads .bashrc and most people in this environment have a preloaded modules etc etc. So the modules provided by ondemand rpm conflicted with our nfs mounted modules command. I work around would be good.
–David
I see - I’m finding now that scl-utils
sets up a module system which is unfortunate.
Hi,
So I made the ondemand server a submit server. I finally got it to submit a job from the gui. That being said it still have same problems. Actually we have a new one:
/home/david/.lsbatch/1718224327.29911075.shell: line 10: module: command not found
Python 3.10.5
python: can’t open file ‘/home/david/hello.py’: [Errno 2] No such file or directory
BUT!!! now that it’s a submit server.
If I just use the shell and do:
bsub < submitscript.sh
Works fine.
–David
Hi,
Do you know anybody that has this running on a LSF 10.1 system?
–David
Only through inference answering questions here. I’m looking at the LSF documentation - and can’t find how they build environments and so on. To get modules working, yes you could source /etc files, but you’d have to do it in every script all the time which is not a very good UX. So I’m looking here on this site to see if anyone’s seen the same issue with batch jobs and the job composer.
Hi,
Yes, that is why we have modules on an NFS mount and it’s in our bashrc.
What about OOD_DATAROOT? and is there a variable for “script location?”
–David
There is no environment variable for script location. I’m not sure how the OOD_DATAROOT
can help you - if you can’t find modules in your $HOME, you won’t be able to find them in some other directory either.
Hi,
So let us take a step back. I figured out how to get modules to work. I just click “copy environment” The real problem is that when you execute a batch script is says it can’t find the script. example:
python hello.py - can’t find hello.py.
Why?
Because hello.py is in the script location and not at home or $H0ME where LSF is looking. So we just need to tell LSF that hello.py is in “script location”. Can we change the “script location” to $HOME. Some posts suggest if I change OOD_DATAROOT that might help the situation but isn’t a long term solution.
–David