Dev app on AWS EC2 - Cluster does not exist

jlaura · January 16, 2025, 7:01pm

I am running some small tests, deploying OnDemand to an EC2 instance. The installation went well and I have a localhost cluster that I can log in to via the “Clusters” dropdown as a user.

Cluster specification (permissions are -rw-r--r--. 1 root root 196 Jan 16 18:32 linux_host.yml)

# /etc/ood/config/clusters.d/linux_host.yml
---
v2:
  metadata:
    title: "ARC" 
    hidden: false
  login:
    host: "localhost"
  batch_connect:
    basic: true

I then have a user (jlaura; myself) with a dev app. The form.yml is as follows:

attributes:
  n_nodes:
    label: "Number of nodes in the cluster"
    widget: "number_field"
    value: 2
    min: 1
    max: 64
  n_cpus:
    label: "Number of CPUs per Node"
    widget: "number_field"
    value: 2
    min: 1
    max: 64
form:
  - n_nodes
  - n_cpus
cluster: "linux_host"
script: "template/script.sh.erb"

and the submit.yml.erb is:

---
batch_connect:
  template: "basic"
  script_wrapper: |
    <%= script %>
  native:
    - "bash"

When I attempt to launch the app, I am erroring with This app requires clusters that do not exist or you do not have access to.. I can confirm that the ‘jlaura’ user can ls and cat down the /var/www/ood/apps/dev/jlaura/ path. I also ran yamllint on all of the files checking for syntax issues. Finally, I can access the cluster via the “Clusters” dropdown. Any other places that I should be looking at?

jlaura · January 17, 2025, 11:45pm

I have updated the cluster config to the following. It looks like I need to have the job defined.

---
v2:
  metadata:
    title: "linux"
    hidden: false
  login:
    host: "localhost"
  job:
    adapter: "linux_host"
    submit_host: "localhost"
    ssh_hosts:
      - "localhost"
    site_timeout: 7200
    debug: true
    singularity_bin: /usr/local/bin/singularity
    singularity_bindpath: /etc,/media,/mnt,/opt,/run,/srv,/usr,/var,/users
    singularity_image: /opt/ood/linuxhost_adapter/amazonlinux2.sif
    strict_host_checking: false
    tmux_bin: /usr/bin/tmux

The error that I am seeing now is: undefined method start_with?’ for nil:NilClass. I have:

Confirmed the singularity install and the container.
Check the user error logs which are not pointing to anything that I understand:

App 2404972 output: [2025-01-17 23:39:14 +0000 ]  INFO "method=GET path=/pun/sys/dashboard/apps/icon/htcondor/dev/jlaura format=html controller=AppsController action=icon status=200 allocations=2917 duration=3.42 view=0.00"
App 2404972 output: [2025-01-17 23:39:16 +0000 ] ERROR "ERROR: NoMethodError - undefined method `start_with?' for nil:NilClass"
App 2404972 output: [2025-01-17 23:39:16 +0000 ]  INFO "execve = [\"git\", \"describe\", \"--always\", \"--tags\"]"
App 2404972 output: [2025-01-17 23:39:16 +0000 ]  INFO "method=POST path=/pun/sys/dashboard/batch_connect/dev/htcondor/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=200 allocations=17961 duration=66.10 view=11.76"

I do see that error appearing in some other help requests and it looks like this relates to errors in the YML files. I’ll start there.

jeff.ohrstrom · January 24, 2025, 2:55pm

Hi and welcome! Sorry the delay in the response.

I have to admit the linuxhost adapter is super difficult to debug. I see you already have the debug flag set. When this is set you’ll get 2 shell scripts (in your HOME? not 100%) you can use to replicate.

You may be hitting this spot. The other spot I see start_with? in that file has a protection against nil, so I doubt that’s the error.

You seem to have some issue parsing the script.sh.erb?

github.com/OSC/ood_core

lib/ood_core/job/adapters/linux_host/launcher.rb

ea172947f


      
                !session_hash[:session_name].nil? && 
                session_hash[:session_name].start_with?(session_name_label)
            end
          rescue Error => e
            interpret_and_raise(e)
            []
          end
          
          def user_script_has_shebang?(script)
            return false if script.content.empty?
            script.content.split("\n").first.start_with?('#!/')
          end
          
          def error_path(script)
            return script.error_path.to_s if script.error_path
            return script.output_path.to_s if script.output_path
          
            '/dev/null'
          end
          
          # under some conditions tmux returns status code 1 but it's not an actual

jeff.ohrstrom · January 24, 2025, 3:13pm

This may be your issue passing <% script %> here to the wrapper. When you navigate to the jobs’ directory (in ~/ondemand/data/sys/dashboard/batch_connect/....), what does script.sh.erb look like? Is it empty?

Can you tell me what you’re trying to do here, either by allowing the user to specify script (I don’t see it in the form) or you have the variable script defined somewhere?

---
batch_connect:
  template: "basic"
  script_wrapper: |
    <%= script %>
  native:
    - "bash"

jlaura · January 24, 2025, 4:28pm

@jeff.ohrstrom Thanks for the assist. I was able to get the linux host working late yesterday. I can get the script to fire. I had to update to the following:

Cluster config:

# /etc/ood/config/clusters.d/linux.yml
---
v2:
  metadata:
    title: "linux"
    hidden: false
  login:
    host: "localhost"
  job:
    adapter: "linux_host"
    submit_host: "ip-xx-xx-xx-xxx.us-west-2.compute.internal"
    ssh_hosts:
      - "ip-xx-xx-xx-xxx.us-west-2.compute.internal"
    site_timeout: 7200
    debug: true
    singularity_bin: /usr/local/bin/singularity
    singularity_bindpath: /etc,/media,/mnt,/opt,/run,/srv,/usr,/var,/users
    singularity_image: /opt/ood/linuxhost_adapter/amazonlinux2.sif
    strict_host_checking: false
    tmux_bin: /usr/bin/tmux

The submit.yml.erb had to be stripped down to:

---
batch_connect:
  template: "basic"

And the script to run is now:

#!/bin/bash

echo "In script"
aws --version

Output is exactly what we expect to see:

Script starting...
Generating connection YAML file...
In script
aws-cli/2.17.18 Python/3.9.20 Linux/6.1.119-129.201.amzn2023.x86_64 source/x86_64.amzn.2023
Cleaning up...

Ultimately, we are trying to have a consistent OnDemand UI where a user can spin up an ephemeral AWS cluster (using HTCondor or parallel cluster) and get a shell. The first step here was testing if we could use Open OnDemand and be able to access the AWS CLI. That is a success and means that we should be able to provision the ephemeral resources. The next step is learn how we might (or might not) be able to get a remote shell on the ephemeral head node in the browser / OnDemand ecosystem. Any thoughts on that?

jeff.ohrstrom · January 24, 2025, 4:39pm

Sounds like you want a login only cluster? I.e., you can’t schedule jobs on it (no need for a batch connect application), but it’ll appear in the Clusters menu to shell into.

https://osc.github.io/ood-documentation/latest/installation/cluster-config-schema.html#login-cluster-only

jlaura · January 24, 2025, 5:04pm

Yes, I think so. Is there a way to dynamically populate that list for a specific user, without having to create the cluster.yml and then restart the service?

As an example, I tried modifying the template/script.sh to the following (this is super janky):

#!/bin/bash

# Define the remote machine and user
REMOTE_MACHINE="localhost"
REMOTE_USER="${USER}"  # Use the current user

# Start the SSH session and execute bash
echo "Connecting to ${REMOTE_MACHINE}..."
ssh -tt ${REMOTE_USER}@${REMOTE_MACHINE} <<EOF
echo "You are now on ${REMOTE_MACHINE}."
exec bash
EOF

Which keeps the app running and I get the nice “Connect to AppName” button in the OnDemand UI. The URL for that button 404s. Is there a way to set that URL in the submission script?

I think this is covered in the app setup here and the follow on page around setting up the reverse proxy.

jeff.ohrstrom · January 24, 2025, 5:20pm

Dynamically create? Maybe? Avoid the cluster.yml file, no.

Maybe pun_pre_hook_root_cmd? It’s a hook you can run as root before the PUN starts up. But if you supply files to cluster.d directory, they’re available to all users. So you’d have to chown & chmod so it’s only visible to that user.

Also you can bounce the PUN through a command. So if you have cloud-init type stuff going on or however this is created you can use the nginx_stage command to bounce a users PUN (i.e., without having them bounce their own PUN).

https://osc.github.io/ood-documentation/latest/reference/files/ood-portal-yml.html

I’m just now seeing that the docs don’t include the --user flag. You can use nginx_stage nginx_clean --user=jeff or similar.

https://osc.github.io/ood-documentation/latest/reference/commands/nginx-stage/commands/nginx-clean.html

bpeisenbraun · January 24, 2025, 8:44pm

The way they handle configuring OOD after changes to the cluster infra in the AWS sample project for OOD on AWS ( GitHub - aws-samples/open-on-demand-on-aws ) is by attaching an EventBridge rule to the cluster creation that runs an SSM Run Command shell script on the OOD instance to update the configuration files and restart OOD.

The CFN for it is here: open-on-demand-on-aws/assets/cloudformation/ood.yml at main · aws-samples/open-on-demand-on-aws · GitHub

They have some other scripts in the repo that are run from the Launch Template for the OOD instance. E.g. their SSH configuration is this one: open-on-demand-on-aws/scripts/configure_cluster_for_ood.sh at main · aws-samples/open-on-demand-on-aws · GitHub

Is it a hard requirement that the cluster be fully dynamic and brought up via a button in the OOD interface?

jlaura · January 24, 2025, 9:00pm

Thanks for the links. Super interesting!

Is it a hard requirement that the cluster be fully dynamic and brought up via a button in the OOD interface?

Soft yes (soliciting suggestions). We are exploring options where we do not need to give the users access to the AWS web console / CLI. Right now, we have a service catalog product, but that requires that users have access to the console to launch. We use OOD for things like Jupyter notebooks and are exploring if we can also use it to provision the service catalog product and get the user in browser shell access.

bpeisenbraun · January 24, 2025, 9:56pm

You can stop (not terminate) PCluster head nodes, and as long as no running jobs complete while the head node is stopped, it doesn’t impact PCluster at all. You only pay for the block storage for the image while they’re not running, and then the button just needs to send the start instance command instead of trying to bring a full cluster and all the supporting infrastructure up.

I think we discovered the stop-but-not-terminate option accidentally one day, and we built a prototype of this functionality, but we haven’t had reason to use it just yet.

Topic		Replies	Views
Randomly my development apps no longer can find our "cluster" Get Help question	4	696	May 26, 2022
How to retrieve the current cluster id in an application form? Get Help	4	27	June 10, 2025
A few questions on the multi-cluster support and Linux host adapter in 1.8 Get Help	5	599	May 26, 2022
Testing ondemand Get Help	11	987	May 26, 2022
Globally defined queues in cluster config and LSF 10 Get Help question	4	393	September 2, 2023

Dev app on AWS EC2 - Cluster does not exist

Related topics