Problems trying to build from source using ansible role

I appreciate your patience on this, you’re likely the first user ever to see that page as I just changed it last week. Sorry for the all the issues.

Not a problem. I am also a n00b so I expect issues, and most of them have to do with my lack of knowledge.

And of course I have more questions. I configured my cluster with:

# /etc/ood/config/clusters.d/my_cluster.yml
---
v2:
  metadata:
    title: "My Cluster"
  login:
    host: "ec2-[redacted].us-west-2.compute.amazonaws.com"
  job:
    adapter: "slurm"
    cluster: "my_cluster"
    bin: "/opt/slurm/bin/"
    conf: "/opt/slurm/etc/slurm.conf"
    # bin_overrides:
      # sbatch: "/usr/local/bin/sbatch"
      # squeue: ""
      # scontrol: ""
      # scancel: ""

I verified I could ssh (as ubuntu, the user I am logged in as ) to the node defined there (also, since this is AWS I tried both the public and private hostname; they both work) . Incidentally, this is the same node where OOD is running. The bin and conf paths are correct. I re-ran /opt/ood/ood-portal-generator/sbin/update_ood_portal, I bounced apache, restarted my PUN from the help menu.
But I can’t do anything related to my cluster. Jobs do not show up in the jobs page even though I kicked off a slurm job manually and see it with squeue. When I try and run a shell or desktop I get errors.

I did find this in /var/log/ondemand-nginx/ubuntu/error.log:

App 17845 output: [2022-05-11 23:21:18 +0000 ] FATAL "ActionController::InvalidAuthenticityToken (ActionController::InvalidAuthenticityToken):"

How do I fix this?

Also I now see this:

App 19034 output: [2022-05-11 23:41:27 +0000 ] ERROR "OodCore::JobAdapterError: squeue: error: Problem talking to database\nsqueue: error: 'my_cluster' can't be reached now, or it is an invalid entry for --clus

I have to figure out the name of the cluster I spun up with AWS ParallelCluster. That’s probably the problem - one of them anyway.

You either need to enable SSL in Apache or to follow the instructions here (if you read the develop SSL docs they say FIXME-LINK-NEEDED which would say something similar).

That’s good you spotted where to look though. If you notice execve in the same logs you can actually see the commands being issued.

OK, I will set up SSL, just being lazy. :wink:

But as for the cluster issue, it appears I need to provide a cluster name. I spun up this cluster using AWS ParallelCluster and it looks like it did not create a federated cluster. The cluster name is parallelcluster but when I submit a job with -M parallelcluster or --cluster parallelcluster I get an error:

sbatch: error: Problem talking to database
sbatch: error: 'parallelcluster' can't be reached now, or it is an invalid entry for --cluster.  Use 'sacctmgr list clusters' to see available clusters.

And the suggested command returns:

You are not running a supported accounting_storage plugin
Only 'accounting_storage/slurmdbd' is supported.

Not sure what’s involved in setting that up. So anyway, I get these errors manually, and also in the OOD logs. I’ll try writing some wrapper scripts that swallow the -M and --cluster options, unless you have another idea.

Thanks.

OK, I will set up SSL, just being lazy. :wink:

But as for the cluster issue, it appears I need to provide a cluster name. I spun up this cluster using AWS ParallelCluster and it looks like it did not create a federated cluster. I can submit jobs from a terminal session just fine using sbatch. The cluster name is parallelcluster but when I submit a job with -M parallelcluster or --cluster parallelcluster I get an error:

sbatch: error: Problem talking to database
sbatch: error: 'parallelcluster' can't be reached now, or it is an invalid entry for --cluster.  Use 'sacctmgr list clusters' to see available clusters.

And the suggested command returns:

You are not running a supported accounting_storage plugin
Only 'accounting_storage/slurmdbd' is supported.

Not sure what’s involved in setting that up. So anyway, I get these errors manually, and also in the OOD logs. I’ll try writing some wrapper scripts that swallow the -M and --cluster options, unless you have another idea.

Thanks.

First find out what works in the CLI. I take it -M and --cluster don’t.

When you configured Slurm, you must have set the cluster attribute. That’s why it’s passing the -M flag. Get rid of the cluster attribute in the YAML configuration and it won’t pass the flag anymore.

https://osc.github.io/ood-documentation/latest/installation/resource-manager/slurm.html

Thanks, but then I get an error when trying to start an interactive desktop:

But that’s ok, I think my workaround is working so far…

The issue you’ve linked there is a new and specific to OOD. You’ve defined clusters in /etc/ood/config/clusters.d/. Let’s imagine you have this Slurm cluster as /etc/ood/config/clusters.d/cool_slurm_cluster.yml and you’re using Kubernetes in AWS with /etc/ood/config/clusters.d/cooler_k8s_cluster.yml.

That app is expected you to specify cool_slurm_cluster or cooler_k8s_cluster in the form.yml file.

Apps can submit to 1 or more of you’re defined clusters (and heterogeneously too, we run our apps on Slurm and Kuberentes both. The user chooses) - but you have to tell our apps which cluster you’ve defined in clusters.d to submit to.

Got it, I followed the instructions and was able to enter the cluster name in the form and submit.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.