Error at the top of OOD page on new setup

We are standing up Open OnDemand for our users that are currently using our HPC, after getting it stood up there is a red banner at the top of the page and will flip between two error messagegs

“The cluster config for viking-cluster has a problem: (): did not find expected key while parsing a block mapping at line 2 column 3”

“The cluster config for viking-cluster has a problem: (): did not find expected key while parsing a block mapping at line 7 column 5”

I have tried looking at the viking-cluster.yml file and made sure that there are spaces and not tabs, reloaded the webpage as well as the webserver and am still encountering the error, I have also tried this with other users who have never been to the site as well as tried it in a “incognito” browser. Line 2 column 3 is where you state metadata and line 7 column. 5 is where i have to declare what we are using (in this case slurm)

Below is the YAML being used to init our cluster

v2:
  metadata:
    title: "HPC Cluster"
  login:
    host: "name.college.edu"
  job:
    adapter: "slurm"
    # cluster: "HPC-Cluster"
    bin: "/cm/shared/apps/slurm/current/bin/srun"
    conf: "/cm/shared/apps/slurm/var/etc/slurm/slurm.conf"
    bin_overrides:
        sbatch: "/cm/shared/apps/slurm/current/bin/sbatch"
        squeue: "/cm/shared/apps/slurm/current/bin/sbatch"
        scontrol: "/cm/shared/apps/slurm/current/bin/scontrol"
        scancel: "/cm/shared/apps/slurm/current/bin/scancel"
        copy_enviornment: false
        batch_connect:
        basic:
          script_wrapper: |
            module purge
            export SLURM_EXPORT_ENV=ALL
            %s
          set_host: "host=$(hostname -A | awk '{print $1}')"
        vnc:
          script_wrapper: |
            module purge
            export PATH="/opt/TurboVNC/bin/:$PATH"
            export WEBSOCKIFY_CMD="/usr/local/websockify/run"
            export SLURM_EXPORT_ENV=ALL
            %s
          set_host: "host=$(hostname -A | awk '{print $1}')"

I can provide a screenshot of the error message at the top of the page if needed.

We are also using Rocky Linux 9 if that helps. Thank you

Hey sorry for the troubles!

Haha, a second pair of eyes always helps! I can see a missing quote at the end of the host on line 5. Let’s start there and see if that fixes anything.

Here’s the corrected line.

  login:
    host: "name.college.edu"

Hello travis,
Yeah i just went on my cluster and that was just from me trying to sanitize and forgot to include the end " . do you see any other problems?

The yaml all checks out as valid and I can’t see anything wrong in the config that would conflict with our docs: Slurm — Open OnDemand 3.0.3 documentation

Yeah this is strange. Those errors on those lines don’t even make sense to me looking at what is there.

I do notice the squeue is pointing to sbatch but that shoulnd’t be the issue.

Since some of this is sanitized for me to see, are you able to take the actual config you have and put it through a YAML validator? That might be the best bet as maybe something in that raw file is just slightly off?

I ran in through my linter in VSCodium and im not getting any errors in there. i even checked to make sure i was using spaces vs tabs. is there a log somewhere i can look at that would output these errors?

Yeah sorry I can’t spot something easy on this one.

We have pages in the docs for logging information. For this the system logs will likely be what you want:
https://osc.github.io/ood-documentation/latest/how-tos/monitoring/logging.html

I also just wanted to check, is the missing --- at the top a typo? I’m guessing so since the linter doesn’t complain on the raw but wanted to check.

yeah i even removed it and didnt get a linter problem

Thanks, I didn’t think it would matter but I’m also a bit baffled at this one.

Is there any pattern to the oscillation in errors? As in, is it reload see x then reload see y then reload see x etc. or is it pretty random?

To be sure, did you have the --- in the file and try a submit?

yes i did sir. i have both and have linted and made modifications on server and tried to submit

Ok, thanks for doing all that and giving me the updates.

I think the next best step is to try and strip this down as far as you can, even using a whole new file just to be sure, and start very simple with what is in the yaml and build up from there to see if the error finally pops up at a specific step.

The base submit to use can just be from our docs:
https://osc.github.io/ood-documentation/latest/installation/resource-manager/slurm.html#slurm

WIth the correct values for your cluster and the configs. Leave out the bin_overrides for trhe moment.

That’s the simplest route i can think of at the moment. It’s very odd this file is not being read, so a fresh file with a minimal build will help find the root cause hopefully.

ok so the minimal build is working without showing the errors at the top of the page. im going to start adding options one by one and see if i can get the error to reproduce and put my findings here.

Ok, glad to hear the errors are cleared at least and you can iterate. Fingers crossed!