We’ve been running OOD for a couple years, and until now, we’ve pursued a policy that limits the amount of resources interactive apps can use. We do this 1) to avoid wasting resources (e.g. launching a 64-core RStudio job when your R code is inherently serial); and 2) to nudge people toward submitting non-interactive jobs when they have larger resource requirements. This stems from the idea that new users will start with Open OnDemand, and slowly transition to traditional HPC once they need to run more computations with more resources.
However, we’re finding that there are many users who would be content never to move past executing their Python code one cell at a time in a Jupyter notebook. And that’s fine – I don’t want to cast aspersions. But some users, if given the option, would launch 64-core, 256 GB Jupyter notebooks with 2 GPUs and leave them running for a week; and more often than not, these users aren’t doing any due diligence to make sure their code is actually using all those resources.
I’d like to know how different centers handle this problem. In particular:
Do you give preference to non-interactive (i.e. “traditional”) jobs over interactive jobs, or do you treat every job equally?
Do you limit the amount of resources interactive apps can use, or do you allow interactive apps to use up to the limit on your queues/machines?
Do you hard-code resource requests based on what is generally sufficient for each app to minimize number of choices/clicks, or do you leave it open-ended?
You could hard code a hour time limit and hide the form field that lets the user select runtime I guess. On the other paw, it depends on if/how you limit cli batch users. If each user as a max of 1000 core hours/month (random value) then a 64 core job will burn thru that in less than a day, and the user won’t be able to work until the next month’s allocation kicks in.
We allocate each PI a number of hours each month for their group to use. If they sit in OOD all month staring at jypyter window, that’s ok - it’s not like they are inducing load on the ood server
We offer preset resources in different “instance” sizes, along with some common sense values for specific apps. For instance, many of our users will run machine learning or parallel code in a Jupyter Notebook, so we offer a small, medium, large, or gpu instance for that app. On the other hand, we have apps such as PyMOL, where we offer a single preset (not visible to user), since it does not require much resources. We do the latter for many of the GUI apps.
We found that some users will still choose the largest instance (the problem you pointed out). To alleviate that issue, we picked values that worked for us (not too restrictive, but not wasteful). Still, when we implemented the drop-down menu of instance size choices, we found many folks picked reasonable sizes for their workload, and enjoyed having fewer options to enter in the interface.
We do not offer QOS preference for traditional jobs over interactive apps. However, with the limits we have in place, there is an inherent incentive to run larger workloads as traditional batch jobs.