Multiple portal instances pointing to one cluster

Hi,
I am looking into multiple portal instances for our cluster.
Is there an option/configuration to spin up multiple portal instances which interact with a single cluster?
Our current setup is hosted in AWS Parallel Cluster. We have one OOD portal node pointing to the following:
Portal node → Login Node(used for SSH connection) → Head Node → HPC Cluster/Compute Nodes
head_separate_login_node
The main problem we are encountering is the Portal instance will get bogged down when new users are logging in simultaneously. The instance resources(100%CPU utilization and around 70% ram utilization) will get taken up and eventually bring down the instance and thus becoming unresponsive.

Thank you in advanced!

-Vesna

There are no settings in OnDemand, but that’s OK. All you need is a load balancer that supports sticky sessions. I.e., when a user is routed to a given instance - they’re always routed to that same instance.

1 Like

I’ll also point out you can have completely different front end nodes that people have to explicitly go to via different names. For example, at OSC we have ondemand.osc.edu which is for general client use. We also have a separate Open OnDemand instance at class.osc.edu that is customized for classroom / student use. Clients can technically log into either one since they both connect to identical resources.

1 Like

What makes the sticky sessions necessary? I’ve seen in other threads that the home directory was mentioned as a limiter, but in our setup we do have a shared EFS volume (Vesna and I work on the same setup).

I’m also wondering what would happen to a user if the portal node running OOD that they’re “stuck” to is unavailable due to problems, and they get routed to a different one.

1 Like

CSRF tokens. Basically when you hit a web FORM (like the ones to submit batch jobs) the app makes sure that the POST request you send has the right token for that request.

So you need to ensure that users see the same instance so they’re passing the right CSRF tokens to it. Otherwise if you get a token from one instance, the other instance won’t recognize it when you pass it to submit a job and it’ll fail.

If they try to submit a job when they get re-routed it’ll likely fail on the second instance. But when they retry it’ll succeed. But this case seems like an edge case because you’d have to fail over right at that same time when the user is trying to submit a job. Unlikely to happen because of the timing, but could.

1 Like

Okay, that makes sense. Thanks for the speedy reply!