Lbnl-nhc check_ps_userproc_lineage check will kill interactive VNC jobs

I’m posting this mostly as an FYI to other OOD site administrators that may also be running the lbnl-nhc Node health check script with Slurm. If you use check_ps_userproc_lineage check with the kill flag it will detect several processes started by VNC as rogue and will kill them. Normally the health check scripts don’t run while jobs are running so this isn’t a problem, but I recently found out that Slurm does run the program when slurmd starts such as when restarting due to an update.
It looks like Xvnc and several processes under it show up in the process table with a parent PID of 1 so this check doesn’t recognize them as being started under Slurm.
I have opened a ticket with Slurm to see if they can allow us to disable the running of NHC at startup, but for now I have just removed the kill flag from this check so that jobs don’t get killed, but it still logs the processes as rogue.
If anybody knows of another way around this I’d love to hear about it otherwise just be aware of this.

Mike

Hi Michael:

Good catch! Also be aware that check will also kill a user ssh’ed into a compute node to monitor their job. You might consider pam_slurm.so in /etc/pam.d/sshd to keep out ssh users that don’t have a running job on the node, and to terminate active ssh sessions when the job ends.

We’ve tried lineage tracing in the past and found it broke a lot more things than it fixed. In our non-slurm days, we did have a script that killed a process on a node if the owner had no batch jobs assigned to that node. So far we haven’t seen a need for that in slurm, but YMMV…

Cheers,

Ric