RStudio Server App Example/Showcase

nvonwolf · January 26, 2022, 8:10pm

Hi All,

I wanted to share our RStudio Server App that we use at New Mexico State University (NMSU). This post is meant to be a showcase for people looking to add RStudio Server to their OOD instance as well as for users who are having trouble getting RStudio Server working (a lot of posts with RStudio errors).

Here is our application’s git repo https://gitlab.com/nmsu_hpc/ood_bc_rstudio. The app was based on OSC’s example app and then somewhat heavily modified.

One of the major additions we made to our version is ensuring that the Slurm, Cuda, TZ, and Singularity environmental variables were made available to R inside of RStudio Server. We do this by launching the container to grab a copy of R’s site Renviron file, make changes to it, then bind mount the updated file when we launch RStudio Server. RStudio drops all envs that are not set by a Renviron file when starting a session so this is required.

Another interesting thing we do is add a bind mount for RStudio’s active sessions directory, to ensure that multiple instances of the OOD app don’t use the same RStudio session folder. This has solved quite a few issues for users who like to multitask.

There are a few other things but go ahead and give it a look. In the readme I have also included an example Singularity build file. The example file does create a usable container without needing any modifications.

Should you clone/fork the repo the setup should only require modifying the form.js, form.yml, and submit.yml.erb files. The only software dependencies are Singularity and a container file (.sif/.simg).

Let me know if you have thoughts, suggestions, or questions.

nvonwolf · February 5, 2022, 12:43am

Just a heads up for anyone who looks at this but after extensive testing the combination of RHEL/CentOS 7 + Singularity (from EPEL) + RStudio Server causes nodes to have major performance issues and eventually completely become unresponsive. This happens when a Out Of Memory (OOM) kill event is triggered which indefinitely hangs trying to kill the rsession process.

Not all R code that exceeds the Slurm Jobs memory limits will hang and cause the node to become unresponsive. That being said we have had enough users trigger this bug that it is not trivial.

I’ve tracked the issue to the following issue on GitHub https://github.com/apptainer/singularity/issues/5850. There doesn’t seem to be a fix other then upgrading to RHEL 8. The kernel module workaround listed at the end of the issue looks like a stop gap but causes kernel crashes, atleast in my case (still debugging why).

For now we are still exploring options as we cannot upgrade to RHEL8 for at least 6 months. Options as they stand right now are either removing RStudio Server Entirely or make jobs exclusive (#SBATCH --exclusive=user) and set the default resource requests as high as possible.

@jeff.ohrstrom do you know if anyone else has come across this?

jeff.ohrstrom · February 7, 2022, 2:53pm

It seems like my ops team had an issue described there about going OOM in Singularity but it wasn’t RStudio - it was a batch job in a software stack I can’t now recall.

Thanks for the heads up. Especially about session sharing - that may be quite usefule indeed.

jeff.ohrstrom · February 7, 2022, 6:26pm

Yes - this ticket (linked from yours) is what we say. This requester is at OSC. But again, it wasn’t with Rstudio.

github.com/apptainer/singularity

Strange swapping behavior with TADbit / Trinity container jobs and cgroups memory limits

opened 08:47PM - 12 Feb 20 UTC

closed 01:38PM - 06 Apr 20 UTC

tabaer

Question NeedsInvestigation

We’ve seen some situations with a couple of our users’ Singularity jobs where th…e jobs will push the nodes several gigabytes into swap even though there is plenty of physical memory available. This causes problems with other users’ jobs or even system services (particularly the IBM GPFS mmfsd daemon) on the node getting swapped out. We have not been able to reproduce this behavior other than using these Singularity jobs. We are wondering if this behavior has been observed at other sites Copying @dpjohnson, @treydock, and @ZQyou, as they are also at @OSC. ### Version of Singularity: ``` troy@owens-login02:~$ singularity version 3.5.2-1.el7 ``` ### Expected behavior Jobs run and use memory up to their requested limits without swapping. ### Actual behavior One example of this using a TADbit Python image ran on a node with 28 cores and 128GB of memory (~120GB usable). The job pushed the node into swap with >11GB of memory free and eventually settled out to using ~80GB of physical memory and 48GB of swap (all available) while having ~45GB of memory free. At no time did the job appear to exceed the memory or memory+swap limits set in its memory cgroup, which in any case were identical. ### Steps to reproduce this behavior We have not been able to find a simple, self-contained reproducer of this problem. The most common offenders are jobs running Singularity containers with either the TADbit Python stack (https://github.com/3DGenomes/TADbit) or the Trinity RNA-seq stack (https://github.com/trinityrnaseq/trinityrnaseq). ### What OS/distro are you running ``` troy@owens-login02:~$ cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.6 (Maipo) ``` Other potentially relevant system info: - NFS-root environment - TORQUE 6.1.2 with cgroup support enabled - In the job’s memory cgroup, memory.limit_in_bytes and memory.memsw.limit_in_bytes are identical. - VM related sysctl settings: - vm.overcommit_kbytes = 0 - vm.overcommit_memory = 0 - vm.overcommit_ratio = 50 - vm.swappiness = 60 ### How did you install Singularity RPMs built from the upstream spec file.

nvonwolf · February 28, 2022, 4:04pm

Just an update the workaround located at https://github.com/pja237/kp_oom has temporarily fixed this issue for us on RHEL 7 while we plan our upgrade to RHEL 8. If anyone is on RHEL 7 I highly recommend you give this a look.

Zorlin · March 13, 2022, 3:55am

Thanks for sharing this! We might be able to use this at my workplace

emily.dragowsky · April 8, 2022, 4:02pm

I need some help troubleshooting problems some folks are having with Interactive App requests that cancel due to Slurm problems (e.g. Reason=launch_failed_requeued_held). This has been occurring frequently within the context of a course assigning homework to be completed through rserver applications.

As I understand it, the delay of Slurm being able to provide a node for the requested job can be managed through a ‘keep-alive’ change the websockets. We will attempt to implement this, and report back.

Relatedly, is it possible for a person with an established Passenger session to “make the situation worse” through repeatedly launching the same application? This question leads to the more general question of best practices for end-users to keep their Nginx/Passenger environment in good shape. What is the role of clearing cookies and logging out expllicitly from OOD in troubleshooting? For example, a student yesterday evening launched the target app, experienced a “requeued” failure. Launched again a few minutes later, and Slurm assigned that request to a different node, where the resources were available promptly and the app ran. This job was (it seems explicitly) cancelled a few hours later, and then the student attempted to start the app again twice within a minute, generating back-to-back slurm jobid, each failing with “launch_failed_requeued_held”. Automated cleanup of passenger sessions proceeded over the next two hours (one each hour) for this student.

After the fact, I am not able to use nginx_show --user= to determine whether they had managed to create more than one session, or to understand any further what was going on in their environment. I have looked in /var/log/ondemend-nginx//error.log, and seen the sequence of application launches, but am unable to glean much other info from the log.

nvonwolf · April 8, 2022, 4:29pm

First any issues with Nginx/Passenger should be brought up with the OOD team as they will be able to provide better help then me.

Concerning the RStudio Server Interactive App itself there are a few things to keep in mind. The only timeout built into the App is when it goes from pending to a running state in Slurm. This is the max delay we wait for RStudio to start, once the job is launched, before giving up (template/after.sh · master · NMSU_HPC / ood_bc_rstudio · GitLab). Another thing to keep in mind is that RStudio Server is not requeue friendly and should not be used in a partition that uses preemption and/or requeues jobs. While technically it can be stopped and restarted, any user sessions would be lost and will require the user to log in again and restart their code manually.

One thing to keep in mind is to make sure you are not forcing RStudio to use the ‘–userns’ option as that will cause every job to extract the Singularity SIF image to a temporary directory before starting. This can very easily overload your filesystem and cause problems.

nvonwolf · April 8, 2022, 4:34pm

Almost forgot, the error your seeing ‘launch_failed_requeued_held’ is a problem with Slurm launching jobs and not with OOD submitting jobs. You’ll need to track that down before anything else.

jeff.ohrstrom · April 8, 2022, 4:40pm

Yes I believe debugging Nginx/Passenger here is a red herring. Meaning - it’s not the issue.

Not on the OOD side. If Slurm had trouble with the first job submission, submitting more jobs may make it worse, but it seems like Slurm should be fairly stable. It has been in my experience, but clearly it’s not behaving so well for you.

OOD itself is fairly stateless. You login, and we (OOD) sees “oh you’ve got jobs A, B and C running” (found from some files we wrote), so let’s query for them by makeing some squeue commands.

Clearing cookies and/or restarting the PUN isn’t going to change anything on the Slurm side, which seems to be where your issue lies.

But yes as @nvonwolf indicates - Rstudio may or may not like being re-scheduled. I don’t know how often, if ever we get pre-empted.

I found this issue from Slurm. Could it be that you’re getting launch_failed_requeued_held because of some bad nodes in your cluster? I mean this error could be localized to only 1 or more nodes.

emily.dragowsky · April 8, 2022, 4:41pm

Thanks, Nicholas. I have been puzzled that these failures do not even generate an output.log – does this suggest that the timeout in after.sh.erb is not being reached? That the launch_failed_requeued_held is cancelling the job more directly? Since the job is requeued, that often (thought not always) assigns the job to another node. Does that interfere with authenticating the websocket connection?

jeff.ohrstrom · April 8, 2022, 4:47pm

It would, but the job never started in the first place, so there’s no connection to reset (or remake).

That said - if an interactive job does move nodes after it successfully launched, it’s up to the app itself (this case Rstudio) to save state and so on (Jupyter as an example may be better at checkpointing itself for this case).

At this point, connections will be lost but the user should be able to just click the ‘connect App’ button as it should have refreshed. That is, they can click that button, and they’ll connect to the new host.

emily.dragowsky · April 8, 2022, 4:53pm

Hi, Jeff – Yes, sorry that I was unclear about the reason behind the problem. The use of this rserver app sometimes triggers the RHEL 7 ‘Singularity OOM’ bug, causing major problems for the host node. Based on your feedback (and that of Nicholas), then when other students are launching new app requests (through OOD), Slurm does not yet know how messed up that node is, but Slurm cannot succeed in allocating from that node, and therefore the “requeued” state is reached – which is seemingly terminal then for the rserver app, independent of any timeout settings. It will just fail.

So if a student were aware, they could check for the “Reason=launch_failed_requeued_held”, and then relaunch the job request, but perhaps include “–exclude=” as a flag to slurm.

It is useful to know that we need to manage the issue fundamentally through the Slurm operations. We’re trying out the workaround announced here in the discourse:

We can report back on that once we have some experience.

system · October 5, 2022, 4:53pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Osc_bc_rstudio_server streamlined submit.yml.erb Get Help question	29	964	December 18, 2022
RStudio server 4.3.2 with Rocker Singularity Get Help	23	478	September 25, 2024
Looking for some debugging tips re: rstudio app on el8 Get Help	10	2677	May 26, 2022
RStudio App stuck on 'Starting' Get Help ondemand2 , question	3	442	April 9, 2023
RStudio Server Issues Get Help question	13	1859	November 23, 2022

RStudio Server App Example/Showcase

Related topics