I guess i’m going to be the pessimistic one in this thread. 
I’m not surprised at all that it isn’t solved, because it’s pretty much impossible for many reasons in general. Not only in OOD, or due to it being HPC, but even on a PC you are asking the impossible.
The state-saving functionality mentioned regarding containers, all goes back to DMTCP that Morgan mentioned earlier via Checkpoint Feature — Apptainer User Guide main documentation (or possibly CRIU in the future)
But many complication arises, many of which aren’t even related to HPC or clusters. What comes to mind are
- using a GPU or other device? No hope here unless there is some near magical developments in DMTCP/CRIU.
- did you application use a license? That license acquisition is almost certainly expired.
- A connection to wandb or similar for logging results? Your app will now crash unless it has specific code to handle a dead connection.
- Any multi-process things, MPI, Ray, Dask, etc, i’m not giving you much chance there either.
- My nodes have >1TB of RAM. I certainly don’t want to fill up my shared filesystem with 1TB checkpoints of RAM dumps.
- You used a particular port number? Better hope it just happens to be free this time as well
- Did you have a file handle open on the network file share? oh no.. that’s no good.
- Hope your program didn’t refer to any local temporary files and sockets either, because there is absolutely no way for anyone to track that, lest you add restoring /run/user and /tmp and /var/run/user on top of this already impossible task.
- hostname, tmpdir locations and a slew of other parameters might have changed if this is a job. Better hope your application doesn’t care to much about any of those.
Even with a VM that you can snapshot you still have to deal with all problems listed above because you surely want it to access the network filesystem.
Maybe some of these will be tackled in time with further developement of CRIU or DMTCP can slightly improve the sitation, but it’s probably always going to be hit-and-miss for the most straight forward of applications, and they can never handle all of the problems i listed above, like a busy port; that will never have a solution.
In my opinion, application level checkpoints, where the application itself knows how to initialize all it’s resources and restore from a previous generated checkpoint file is the only thing that can ever reliably work.
Signals is the standard way of handling this; SLURM will send a signal (SIGTERM) to inform the application it’s time to close, then after 30 seconds send kill signal if things are still not done (signal types, wait period is configurable). Your application could chose to save the state upon receiving such a signal, or it could of course just save regular checkpoints.
So, assuming the application can support app-level suspending and resuming i suppose there could be some type of feature for checkpointing, customize an OOD app to allow for specifying the file to resume from via the file browser and point your startup shell scripts to -–resume %{checkpoint_file}%
or whatever flags that particular application expect.
Make the application write checkpoints to some known directory when it receives SIGTERM.
I can’t think of a single desktop or interactive application that supports a meaningful form of this, apart from programs that just open files normally. Maybe state files in ParaView?
In my mind, checkpoint+resume is basically exclusively for batch jobs. For interactive jobs, the mantra is always “save often”.
I “solve” this problem by also allowing users to launch a desktop session on the login node as well via an OOD app (and there is no time limits to those shared desktop sessions) + the recommendation to “save often” because RAM is never backup up and not even a functional suspend feature (which can’t exist) would save you from a sudden power outage