I’m using Docker to launch NVidia containers to run Jupyter. Unfortunately, the containers are not stopped when the job completes, and resources are withheld as a result. I’m thinking I could handle this by generating a random string in before.sh.erb, save it as an env var, use it as part of the docker run command in script.sh.erb, and then clean up in after.sh.erb. Does this sound about right?
A related question is, am I reinventing the wheel here? I would think this would be a common need. Especially since DeepOps uses OOD. Makes me wonder if I’m overlooking a prior solution to this…
after.sh.erb doesn’t run after the job, it runs after the
As to the actual issue you’re facing, I’m not sure what’s going on. What scheduler are you using? It seems like the scheduler should know that the container was launched as a part of the job (or does indeed know because it continues to withhold resources) and do what’s required to stop it when the job stops.
Which is to say, OOD doesn’t play much of a role at this point. Once we schedule the job it’s just a script running in the scheduler’s hands. I’m wondering if there’s some misconfiguration on your scheduler’s side that allows this. Can you replciate similar behavior from the command line? Like running a container that sleeps for 7200 but the job only has a walltime of 3600?
This is slurm. My script runs this:
docker run --rm --mount type=bind,source=/home/myser/ondemand/data/sys/dashboard/batch_connect/dev/jupyter-docker/output/63a4aa68-7d37-4a5d-8100-b2b146a497c6/config.py,target=/config.py --mount type=bind,source=/home/myuser,target=/home/myuser -p 31636:31636 nvcr.io/nvidia/tensorflow:20.12-tf2-py3 jupyter lab --allow-root --config=/config.py
That should be running the container in the foreground (versus -d which would make it detached). Any thoughts as to why the container keeps running after the job completes?
My guess is it’s not a part of the jobs’ cgroup somehow. We don’t have docker at our site, but on my machine the default cgroup seems to be systemd’s. Maybe try docker run with
--cgroup-parent=/proc/self/cgroup. Seems like you’d also need slurm.conf configured with
Also this could be a reason to avoid docker and maybe try
podman? Glancing at the Slurm docs for containers, apparently even Nvidia has a container runtime
enroot. Though I’ve only heard about it now and can’t really say much about it. But I can speak to podman’s usefulness a lot. Completely unprivileged drop in replacement for docker so you don’t have to worry about stuff like this (let alone the security!).
We looked at Podman but it seemed like it would require us to manage ranges of subuids/subguids for onboarded users which isn’t desirable.
Trying to run straight Docker looks like too problematic to consider:
Will take a look at enroot…
Also worth mentioning–it works fine with Singularity-converted containers. I was just hoping to avoid the conversion step. But aside from that, it seems like a pretty clean solution.
Thanks for the pointers…
This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.