Question:
Can the interactive app feature handle job requeuing in Slurm? For example a backfill partition were jobs are commonly stopped, started, and requeued during their lifetime. It is important to note when jobs start again in this manner they often end up running on different nodes. I haven’t tested but I’m not sure the interactive app cards currently handle this edge case.
Use Case:
We have a user who needs to run a long term matlab simulations and wants to be able to occasionally use OOD to check the visualizations and such while it is running. This is less about doing things interactively and more about managing jobs that require a desktop environment. While this is a use case that could be handled differently it does present an interested edge case that could be very useful for things.
Workaround:
For now I’ve come up with a workaround that is a simple SBATCH script that does does everything OOD does when launching a desktop that then calls the users script that runs their code. This is almost an identical workflow to how the examples for the MatLab interactive app works (launch desktop then launch app).
The user then logs into OOD and manually adjusts the URL’s node/rnode parameter to access the Node and VNC port manually. Everytime the job is requeued (stopped & restarted) the submission script will write the new node and port to a file the user can check.
Feature Request:
Add a flag, and functionality, to interactive apps so that the job cards can handle job requeuing which includes changes to node, port, and current job state. It should reflect on the card if the job has been stopped and is currently pending to be started again.