When does vnc_clean in cluster.d run?

Hi all!

We need to cleanup the stray X11 locks and sockets after our users’ Interactive Desktop jobs end. I saw some helpful info about using the vnc_clean hook in clusters.d:

My question is: When does the vnc_clean hook run? I assumed it ran at some point after the VNC session ended. But I’m running some tests, and looks like during an active VNC session, the hook has already run.

Thanks,
Ron

If you look in job_script_content.sh you’ll find it here (this is the default). Which means it runs before it attempts to clean them up before it launches a new vncserver.

echo "Starting VNC server..."
for i in $(seq 1 10); do
  # Clean up any old VNC sessions that weren't cleaned before
  vncserver -list | awk '/^:/{system("kill -0 "$2" 2>/dev/null || vncserver -kill "$1)}'

That said - if you look at those lock files they’re owned by users individually. So when I launch my job, I’m only able to clean my old lock files.

As that other topic indicates - you’ll need some root privileges to delete all lock files.

1 Like

Seems like you could use a Slurm prologue (if you run Slurm) to do this work. Either as the user and clean their own lock file or as root and clean all (though you’d have to check to be sure what you’re removing isn’t currently running).

Thanks Jeff, it’s good to know that’s a default behavior now.

I think our issue is that a given user doesn’t land Interactive Desktop jobs on the same node often enough. So I think the locks from previous jobs aren’t getting cleaned up very often.

Is there a reliable way to do the cleanup at the end of a job? I understand that clean.sh doesn’t always run, which makes sense given all the ways a job can be interrupted.

Slurm has facilities for doing things after jobs complete. There are similar things for other schedulers.

https://slurm.schedmd.com/prolog_epilog.html

More complex options could be using FUSE to mount a temporary file system that’ll get clean up when the processes/job completes (I think it’ll be clean up - I’m not 100% sure on FUSE mounts, never actually used them, just know they exist)

We currently do some cleanup for other purposes in the Slurm epilogue. All our current cleanup is for job allocations, and hence runs as slurmd user. To use vncserver -list for cleanup, it seems like we would need to do it in parts of the epilog that run as the job’s user?

Seems like it. If the user can clean up their own lock files then over time you’d have 0 lock files that aren’t in use. They’re in /tmp after all, so a reboot on the compute node may get rid of them all. Like I say, over time, with maintenance windows restarting your compute nodes and an epilogue that can clean up what the user has created you should be in good shape.

Thanks again. We’ve decided we’ll start cleaning out the old locks a bit more frequently than we reboot the nodes, and that should work fine for us. It’s not as if /tmp fills up; we just run out of ports.

We only saw this on our instructional cluster now, at the beginning of the semester. The other research clusters are fine. It’s probably just a weird usage pattern that happening with the students at the beginning of this semester.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.