I wanted to share our RStudio Server App that we use at New Mexico State University (NMSU). This post is meant to be a showcase for people looking to add RStudio Server to their OOD instance as well as for users who are having trouble getting RStudio Server working (a lot of posts with RStudio errors).
One of the major additions we made to our version is ensuring that the Slurm, Cuda, TZ, and Singularity environmental variables were made available to R inside of RStudio Server. We do this by launching the container to grab a copy of R’s site Renviron file, make changes to it, then bind mount the updated file when we launch RStudio Server. RStudio drops all envs that are not set by a Renviron file when starting a session so this is required.
Another interesting thing we do is add a bind mount for RStudio’s active sessions directory, to ensure that multiple instances of the OOD app don’t use the same RStudio session folder. This has solved quite a few issues for users who like to multitask.
There are a few other things but go ahead and give it a look. In the readme I have also included an example Singularity build file. The example file does create a usable container without needing any modifications.
Should you clone/fork the repo the setup should only require modifying the form.js, form.yml, and submit.yml.erb files. The only software dependencies are Singularity and a container file (.sif/.simg).
Let me know if you have thoughts, suggestions, or questions.
Just a heads up for anyone who looks at this but after extensive testing the combination of RHEL/CentOS 7 + Singularity (from EPEL) + RStudio Server causes nodes to have major performance issues and eventually completely become unresponsive. This happens when a Out Of Memory (OOM) kill event is triggered which indefinitely hangs trying to kill the rsession process.
Not all R code that exceeds the Slurm Jobs memory limits will hang and cause the node to become unresponsive. That being said we have had enough users trigger this bug that it is not trivial.
I’ve tracked the issue to the following issue on GitHub https://github.com/apptainer/singularity/issues/5850. There doesn’t seem to be a fix other then upgrading to RHEL 8. The kernel module workaround listed at the end of the issue looks like a stop gap but causes kernel crashes, atleast in my case (still debugging why).
For now we are still exploring options as we cannot upgrade to RHEL8 for at least 6 months. Options as they stand right now are either removing RStudio Server Entirely or make jobs exclusive (#SBATCH --exclusive=user) and set the default resource requests as high as possible.
Just an update the workaround located at https://github.com/pja237/kp_oom has temporarily fixed this issue for us on RHEL 7 while we plan our upgrade to RHEL 8. If anyone is on RHEL 7 I highly recommend you give this a look.
I need some help troubleshooting problems some folks are having with Interactive App requests that cancel due to Slurm problems (e.g. Reason=launch_failed_requeued_held). This has been occurring frequently within the context of a course assigning homework to be completed through rserver applications.
As I understand it, the delay of Slurm being able to provide a node for the requested job can be managed through a ‘keep-alive’ change the websockets. We will attempt to implement this, and report back.
Relatedly, is it possible for a person with an established Passenger session to “make the situation worse” through repeatedly launching the same application? This question leads to the more general question of best practices for end-users to keep their Nginx/Passenger environment in good shape. What is the role of clearing cookies and logging out expllicitly from OOD in troubleshooting? For example, a student yesterday evening launched the target app, experienced a “requeued” failure. Launched again a few minutes later, and Slurm assigned that request to a different node, where the resources were available promptly and the app ran. This job was (it seems explicitly) cancelled a few hours later, and then the student attempted to start the app again twice within a minute, generating back-to-back slurm jobid, each failing with “launch_failed_requeued_held”. Automated cleanup of passenger sessions proceeded over the next two hours (one each hour) for this student.
After the fact, I am not able to use nginx_show --user= to determine whether they had managed to create more than one session, or to understand any further what was going on in their environment. I have looked in /var/log/ondemend-nginx//error.log, and seen the sequence of application launches, but am unable to glean much other info from the log.
First any issues with Nginx/Passenger should be brought up with the OOD team as they will be able to provide better help then me.
Concerning the RStudio Server Interactive App itself there are a few things to keep in mind. The only timeout built into the App is when it goes from pending to a running state in Slurm. This is the max delay we wait for RStudio to start, once the job is launched, before giving up (template/after.sh · master · NMSU_HPC / ood_bc_rstudio · GitLab). Another thing to keep in mind is that RStudio Server is not requeue friendly and should not be used in a partition that uses preemption and/or requeues jobs. While technically it can be stopped and restarted, any user sessions would be lost and will require the user to log in again and restart their code manually.
One thing to keep in mind is to make sure you are not forcing RStudio to use the ‘–userns’ option as that will cause every job to extract the Singularity SIF image to a temporary directory before starting. This can very easily overload your filesystem and cause problems.
Yes I believe debugging Nginx/Passenger here is a red herring. Meaning - it’s not the issue.
Not on the OOD side. If Slurm had trouble with the first job submission, submitting more jobs may make it worse, but it seems like Slurm should be fairly stable. It has been in my experience, but clearly it’s not behaving so well for you.
OOD itself is fairly stateless. You login, and we (OOD) sees “oh you’ve got jobs A, B and C running” (found from some files we wrote), so let’s query for them by makeing some squeue commands.
Clearing cookies and/or restarting the PUN isn’t going to change anything on the Slurm side, which seems to be where your issue lies.
But yes as @nvonwolf indicates - Rstudio may or may not like being re-scheduled. I don’t know how often, if ever we get pre-empted.
I found this issue from Slurm. Could it be that you’re getting launch_failed_requeued_held because of some bad nodes in your cluster? I mean this error could be localized to only 1 or more nodes.
Thanks, Nicholas. I have been puzzled that these failures do not even generate an output.log – does this suggest that the timeout in after.sh.erb is not being reached? That the launch_failed_requeued_held is cancelling the job more directly? Since the job is requeued, that often (thought not always) assigns the job to another node. Does that interfere with authenticating the websocket connection?
It would, but the job never started in the first place, so there’s no connection to reset (or remake).
That said - if an interactive job does move nodes after it successfully launched, it’s up to the app itself (this case Rstudio) to save state and so on (Jupyter as an example may be better at checkpointing itself for this case).
At this point, connections will be lost but the user should be able to just click the ‘connect App’ button as it should have refreshed. That is, they can click that button, and they’ll connect to the new host.
Hi, Jeff – Yes, sorry that I was unclear about the reason behind the problem. The use of this rserver app sometimes triggers the RHEL 7 ‘Singularity OOM’ bug, causing major problems for the host node. Based on your feedback (and that of Nicholas), then when other students are launching new app requests (through OOD), Slurm does not yet know how messed up that node is, but Slurm cannot succeed in allocating from that node, and therefore the “requeued” state is reached – which is seemingly terminal then for the rserver app, independent of any timeout settings. It will just fail.
So if a student were aware, they could check for the “Reason=launch_failed_requeued_held”, and then relaunch the job request, but perhaps include “–exclude=” as a flag to slurm.
It is useful to know that we need to manage the issue fundamentally through the Slurm operations. We’re trying out the workaround announced here in the discourse:
We can report back on that once we have some experience.