Is it possible to suspend and resume jobs

After a few years of OnDemand running on the system I manage, users have learned to leave their sessions on as long as they can even if they are not actively running anything, for example if they have a session that takes long time to set up.

I understand their needs and I want to meet them, but I also don’t want to keep those nodes out of the pool without being actively used

This is (obviously) most common for the “Desktop” app, but perhaps applicable to other cases too. Maybe one could leverage some VM and actual OS-suspend code and it might not be a big deal to deploy, but I can still think of some cases where it may fail (open files handle, network connections, etc). Or maybe with the Slurm pre-emption set to suspend.

I suspect that others may have experienced a similar dilemma and I am wondering if OOD supports a “suspend” workflow for at least some jobs. And if not, I am still wondering if anybody has thought more about it than I have and come with some least of reasons why this is (or isn’t) a good idea.

TIA!

So OOD isn’t really the gatekeeper here, the scheduler is and that really brings up an issue that HPC clusters have historically always had an issue with and that is checkpointing as well as preempting jobs.

Unless users write their workflows to incorporate checkpointing, there isn’t really a standard way to “pause” a workflow that isn’t written for it. There have been attempts to create systems that can take a snapshot of what is in memory, kill the process and then restart it (DMTCP out of Northeastern is the first that comes to mind) but its not universal and as with implementing checkpointing in-code, it requires quite a lift with the user to implement.

I won’t speak for others but I have always enforced somewhat restrictive wall-times and/or a pool of resources specifically for OOD interactive jobs. I also have tended to accompany that with support to move long running Interactive jobs to headless batch jobs as more often than not, the users workflow is ultimately not tailored for HPC workflows.

Ultimately, there is no generic or universal way to implement suspending a job. There are some work arounds but in my experience they require more work and maintenance/overhead than you ever gain back in resources. That being said, this is just my experience.

1 Like

We also restrict job lengths and resource availability. (e.g. we have some queues that allow up to 7 days, but we don’t allow OOD jobs on that queue and restrict all OOD jobs to a max of 12 hours, regardless of the actual queue properties)

Some apps, like Jupyterlab, have built in timeouts, so we use those. We let users “choose” a timeout of up to 60 minutes (form: hpc_docs/ood/jupyterlab/form.yml.erb at main · SouthernMethodistUniversity/hpc_docs · GitHub , actually setting the timeout config option: hpc_docs/ood/jupyterlab/template/before.sh.erb at main · SouthernMethodistUniversity/hpc_docs · GitHub )

For other apps, we’ve had limited success. We put a custom script in our RStudio app, for example that checks for idle and will kill it. We’ve had mixed success with that (it seems to fail to kill jobs it should). hpc_docs/ood/rstudio/template/script.sh.erb at main · SouthernMethodistUniversity/hpc_docs · GitHub . That’s not super robust or anything, it just writes a small log file in the session folder with cumulative idle time.

Those 2 apps are our biggest offenders, but we’ve also tried to shutdown remote desktops without a ton of success. We run ours out of a container and one would expect you could use the OS power management settings in the container or similar to trigger a shutdown. We haven’t been sufficiently motivated to test that / get it working.

For a while, we also checked and killed jobs if the “connect to session” button wasn’t clicked within some amount of time. That was pretty unpopular with our users.

Usually we just have to explain that OnDemand is the product name and that the resources often require waiting to access and it is not an “on demand” always available service.

1 Like

In pre-OOD days I used to run an HPC with VNC sessions and had a launcher script which would start interactive jobs such as Matlab, Rstudio, Stata. l had a script which would check CPU usage from cgroup data and determine a job was sitting idle for a period of time. It was sensitive enough that moving a mouse cursor would keep it from flagging a job. When a job was idle it would not kill it but would set a preempt flag and email the user. If another job requested those resources the job would be terminated.

Today the HPC I work with all nodes are capable of running OOD jobs. We use a job submit plugin to adjust priority. In the form we set a limit for the length of time a job can run which is much shorter than the max walltime set on the partition for jobs submitted by other means.

Ultimately the structure of our system users are working on dedicated resources purchased by their lab or using CPU\GPU credits. When using a credit based system it adds incentive to terminate a job so that they are not wasting their allocated resources.

1 Like

Yeah, exactly. I’m surprised that we are all somewhat interested in this problem, but none of us (at least among the ones contributing to this thread) has been motivated enough to investigate it for the root cause and instead we all use various other workarounds which we deem “good enough” (but really aren’t).

Thanks everybody anyway, and keep the suggestions coming.

I guess i’m going to be the pessimistic one in this thread. :pensive_face:

I’m not surprised at all that it isn’t solved, because it’s pretty much impossible for many reasons in general. Not only in OOD, or due to it being HPC, but even on a PC you are asking the impossible.

The state-saving functionality mentioned regarding containers, all goes back to DMTCP that Morgan mentioned earlier via Checkpoint Feature — Apptainer User Guide main documentation (or possibly CRIU in the future)

But many complication arises, many of which aren’t even related to HPC or clusters. What comes to mind are

  1. using a GPU or other device? No hope here unless there is some near magical developments in DMTCP/CRIU.
  2. did you application use a license? That license acquisition is almost certainly expired.
  3. A connection to wandb or similar for logging results? Your app will now crash unless it has specific code to handle a dead connection.
  4. Any multi-process things, MPI, Ray, Dask, etc, i’m not giving you much chance there either.
  5. My nodes have >1TB of RAM. I certainly don’t want to fill up my shared filesystem with 1TB checkpoints of RAM dumps.
  6. You used a particular port number? Better hope it just happens to be free this time as well
  7. Did you have a file handle open on the network file share? oh no.. that’s no good.
  8. Hope your program didn’t refer to any local temporary files and sockets either, because there is absolutely no way for anyone to track that, lest you add restoring /run/user and /tmp and /var/run/user on top of this already impossible task.
  9. hostname, tmpdir locations and a slew of other parameters might have changed if this is a job. Better hope your application doesn’t care to much about any of those.

Even with a VM that you can snapshot you still have to deal with all problems listed above because you surely want it to access the network filesystem.
Maybe some of these will be tackled in time with further developement of CRIU or DMTCP can slightly improve the sitation, but it’s probably always going to be hit-and-miss for the most straight forward of applications, and they can never handle all of the problems i listed above, like a busy port; that will never have a solution.

In my opinion, application level checkpoints, where the application itself knows how to initialize all it’s resources and restore from a previous generated checkpoint file is the only thing that can ever reliably work.

Signals is the standard way of handling this; SLURM will send a signal (SIGTERM) to inform the application it’s time to close, then after 30 seconds send kill signal if things are still not done (signal types, wait period is configurable). Your application could chose to save the state upon receiving such a signal, or it could of course just save regular checkpoints.


So, assuming the application can support app-level suspending and resuming i suppose there could be some type of feature for checkpointing, customize an OOD app to allow for specifying the file to resume from via the file browser and point your startup shell scripts to -–resume %{checkpoint_file}% or whatever flags that particular application expect.

Make the application write checkpoints to some known directory when it receives SIGTERM.

I can’t think of a single desktop or interactive application that supports a meaningful form of this, apart from programs that just open files normally. Maybe state files in ParaView?

In my mind, checkpoint+resume is basically exclusively for batch jobs. For interactive jobs, the mantra is always “save often”.


I “solve” this problem by also allowing users to launch a desktop session on the login node as well via an OOD app (and there is no time limits to those shared desktop sessions) + the recommendation to “save often” because RAM is never backup up and not even a functional suspend feature (which can’t exist) would save you from a sudden power outage

1 Like

I see. From your list of “complications” and your comment

Indicates that you are thinking of a slight different things. I am not interested in solving 1-5, those are for real “jobs” where I agree with you that application-level checkpoints is the only thing with some hope so succeed. In fact the only thing that I think is really hard is 7 and some parts of 9. And I thought (maybe incorrectly?) that VM-level “go-to-sleep” could solve 6,8 and maybe even the other parts of 9.

I think I will go with something like this. Not exactly on the login node (which doubles as a head node in my case and letting too much user stuff running there could bring the whole cluster to its knees), but with a small shared queue or something like that.

Thanks everybody for your thoughts

I am not interested in solving 1-5, those are for real “jobs”

Well I have to disagree; i don’t think interactive pre/post processing, coding, interactive data analytics are not exempt from ever needing licenses, GPUs, or plenty of RAM.
ParaView, MATLAB, STAR-View, ANSA, Ansys, Mathematica, code development, data analytics, is what my users frequently run on my interactive jobs.


A VM can only solve some of the problems only due to being isolated, until you break those (be it PCI passthrough (GPU), kerberos tickets, DB connections, licenses, ssh-connections, or even just a network file handle). That puts some pretty harsh limitations on what you could run (in particular file systems, which is pretty unavoidable). Heck, even a CephFS filesystem would blacklist any client that isn’t responding in 5 minutes, even if you didn’t access any files yet (requiring a manual remount).

Some of the isolating aspects of a VM could be handled by containers as well (e.g. apptainer run --net --contain app.sif)

I don’t see how one would wave away problem 5 though. dmtcp –checkpoint or qemu savevm it’s still going to be there regardless of the choice of technology + the need for tooling to allow users to manage these checkpoints and such.

and letting too much user stuff running there

I would also recommend

# cat /etc/systemd/system/user-.slice.d/50-limits.conf
[Slice]
CPUQuota=1000%  # 10 cores max
MemoryMax=50G
MemorySwapMax=1G

on all login nodes.

The way I run those desktop sessions are not part of SLURM, it’s via LinuxHost — Open OnDemand 4.0.0 documentation ( GitHub - c3se/bc_alvis_vnc: Linux host adapter for running a VNC server ) so not subject to wall time, but resources are shared. It could point to any host OOD VM is allowed to SSH into. We limit the users to only 1 such session (though the systemd slices would still put everything into the same cgroup anyway so they can’t ever escape those limits). So, technically, this could be a VM in my proxmox cluster if I wanted to give users a very long uptimes. But at some point, all machines need to be updated and rebooted; you don’t want to keep security fixes hanging even in a users own VM.

1 Like