Jupyter Notebooks not seeing GPU

Hello,

Wondering if anyone might have some ideas on this. We have Jupyter Notebooks set up at our site with the option to choose a GPU node on the initial launch page. This functionality works, and we confirm that when the Jupyter job is launched on our cluster via OnDemand, it has requested the GPU. We have also confirmed within the output.log that it is loading the appropriate modules and versions (cuda and cudnn). However, when we load the appropriate Conda environment, import tensorflow within Jupyter, and ask it to print the number of GPUs available, it always returns zero. Users are given the choice to load preconfigured Conda environments from within Jupyter, and we have chosen one with a confirmed working Tensorflow setup.

When we complete these steps as an interactive job on our cluster via our scheduler (slurm) - so request a GPU node with gres=gpu, load the exact same cuda/cudnn modules, load the Conda environment that has Tensorflow, and within Python do: import tensorflow as tf and print("Num GPUs Available: ", len(tf.config.list_physical_devices(‘GPU’))) it always “sees” the GPU(s).

We are not sure why it is not working within OnDemand/Jupyter as the preliminary steps taken seem to be the same between both. Any thoughts on this are appreciated.

Can you share your template/script.sh.erb or the script that loads the Conda environment?

I’m not a Jupyter expert but I wonder if you’re loading the Conda environment as a new kernel - or in any case I’d like to know the mechanics of “when we load the appropriate Conda environment, import tensorflow within Jupyter”.

My guess is is that it’s some environmental issue where something is lost in the loading. Clearly the default Jupyter has access to the device, but then loses it when you load the Conda environment. so knowing the mechanics of loading that environment would be helpful, whether it’s a new kernel or updates to an existing one.

Jeff,

Sure, this is the script.sh.erb:

#!/usr/bin/env bash

# Benchmark info
echo "TIMING - Starting main script at: $(date)"

# Set working directory to home directory
cd "${HOME}"

#
# Start Jupyter Notebook Server
#

<%- unless context.modules.blank? -%>
# Purge the module environment to avoid conflicts
module purge

# Load the require modules
#module load <%= context.modules %>

# Load the require modules
module load <%= context.modules %>
module load python/3.10-ondemand-dev
module load cuda/11.6
module load cudnn/8.1.1.33
/gpfs/shared/apps_local/python/3.7-1.20/condabin/conda init
source ~/.bashrc

# List loaded modules
module list
<%- end -%>

# Benchmark info
echo "TIMING - Starting jupyter at: $(date)"

# Launch the Jupyter Notebook Server
set -x
jupyter notebook --config="${CONFIG_FILE}" <%= context.extra_jupyter_args %>

We had the modules in the form.yml originally as an optional selection but decided to put them in script.sh.erb to be sure they are loading on startup. We haven’t found a great way to do this but we make sure everyone runs conda init first so they can “see” all of the conda environments. Then on the ones for use with Jupyter/OOD, we install nb-conda-kernels and ipykernel in these conda environments so they become accessible through Jupyter.

Once this is set up, it provides users with a list of Conda environments they can choose when creating a new notebook, for instance Python [conda env: tensorflow] etc.

I’ll have to setup a kernel myself and see. Maybe @zyou knows something off hand that I don’t?

Also you say when verifying this manually - “load the exact same cuda/cudnn modules, load the Conda environment that has Tensorflow, and within Python do”.

When you say within Python you mean though an interactive python shell? To be clear - this verification does not involve Jupyter, correct?

Yes that’s correct. From within an interactive Python shell on a GPU node (not involving Jupyter).

@rgas20 I assume you have both jupyter kernels for both python/3.10-ondemand-dev and the conda environment. Can you confirm that your test ran in a notebook opened using correct kernel? If not, could you try open a terminal in Jupyter and load the conda environment manually then test tensorflow?

Yes that’s correct. We switched to the correct kernel by opening a new notebook with that kernel and testing out the workflow. Strange because it seems like Jupyter is the only thing in the middle of this not working.

I have to test this setup a little more, but I was able to get this working by using a personal kernel.

That is I didn’t do these step of yours in the application itself, I just made a kernel that looks at a conda environment that has tensorflow and tensorflow-gpu.

It’s not a system shared kernel/conda environment, it was one I setup in my own $HOME.

/gpfs/shared/apps_local/python/3.7-1.20/condabin/conda init
source ~/.bashrc

OK - I would suggest you take this approach.

Instead of issuing this command (I don’t know about the ~/.bashrc)

/gpfs/shared/apps_local/python/3.7-1.20/condabin/conda init

Find your kernel(s) and/or move them to some appropriate place and add that directory to the JUPYTER_PATH.

Here’s what I just tested with:

export JUPYTER_PATH="$JUPYTER_PATH:/users/PZS0714/johrstrom/ondemand/app-testing/kernels"

I just tested this similar setup. I have the conda environment in a different location than my kernels. I’ve added this directory to the JUPYTER_PATH and they show up in jupyter lab.

Note the kernesl/kernels directory name - I just didn’t have a better name for the top level directory that I wanted Jupyter to search through for testing this out.

Essentially what ever directory you add to JUPYTER_PATH it should have 1 child directory named kernels and it has children for each kernel.

[johrstrom kernels()]  pwd
/users/PZS0714/johrstrom/ondemand/app-testing/kernels
[johrstrom kernels()]  ll kernels/conda_discourse-2511/
total 24
-rw-r--r-- 1 johrstrom PZS0714  327 Mar  9 13:02 kernel.json
-rw-r--r-- 1 johrstrom PZS0714 1084 Mar  9 13:02 logo-32x32.png
-rw-r--r-- 1 johrstrom PZS0714 2180 Mar  9 13:02 logo-64x64.png
-rw-r--r-- 1 johrstrom PZS0714 9605 Mar  9 13:02 logo-svg.svg
[johrstrom kernels()]  cat kernels/conda_discourse-2511/kernel.json 
{
 "argv": [
  "/users/PZS0714/johrstrom/ondemand/app-testing/conda/discourse-2511/bin/python",
  "-m",
  "ipykernel_launcher",
  "-f",
  "{connection_file}"
 ],
 "display_name": "discourse-2511 [/users/PZS0714/johrstrom/ondemand/app-testing/conda/discourse-2511]",
 "language": "python",
 "metadata": {
  "debugger": true
 }
}[johrstrom kernels()]  

For each conda environment I set up, I activate the conda env and then:

python -m ipykernel install --user --name myenv --display-name “Python (which env)”

Every kernel I install shows up in Jupyter. I think that’s basically the same result of your kernel.json, Jeff, unless there’s a specific reason you did it your way?

Kenny

I set mine up the same way essentially, I just moved it out of my $HOME afterwards.

This is a way to setup a shared kernel. That is, the kernels you’ve created are in ~/.local/share/jupyter/kernels, you’re own $HOME. So your Jupyter can see them, but I think the desire is to have a global kernel that all users can use.

If you setup a kernel in say /gpfs/shared/apps_local/jupyter/kernels you could add /gpfs/shared/apps_local/jupyter to JUPYTER_PATH and all your users can see that shared/global kernel.

I’m not sure exactly what conda init, if it does the same thing or no, but I’m wondering if we can remove it and share the kernel with some other method if that would help.

/gpfs/shared/apps_local/python/3.7-1.20/condabin/conda init
1 Like

All, thanks so much for the suggestions on this. And some ideas on ways to organize our Jupyter kernels. We don’t exactly know what was happening, but we think we have this fixed as of yesterday. Sometimes when we’d run a Jupyter notebook it would complain about tensorRT and a missing libnvinfer.7 library. Our version should have been using the new library, libnvinfer.8, so one of our admins copied that library to libnvinfer.7 so the names matched and it ended up working. I wish the solution wasn’t as cryptic, but Jupyter now sees the GPU and we’ve successfully run one or two notebooks via that environment. I’m guessing something got corrupted somewhere down the line. We’re definitely in need of more testing, but we have something in-place right now while look into this.

1 Like

Ahh, Thanks for clarifying :slight_smile: . That should come in handy.

Thanks Jeff!
Kenny

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.