Customizing MATLAB form.yml and submit.yml.erb, simplifying default; different versions referenced

I’m trying to customize the form fields to our cluster and simplify it. All the extra node_type and cores_lookup make it a bit difficult to know what can be removed.

The Git page has several more options vs the tutorial page.

I can’t get the versions of MATLAB to appear in the dropdown.

EDIT: I did update form.yml:

form:
  - version

and at the bottom:

  version:
    widget: select
    label: "MATLAB version"
    help: "This defines the version of MATLAB you want to load."
    options:
      - [ "R2022a", "matlab/r2022a" ]
      - [ "R2021a", "matlab/r2021a" ]
      - [ "R2019b", "matlab/r2019b" ]
      - [ "R2018b", "matlab/r2018b" ]
      - [ "R2018a", "matlab/r2018a" ]

I do see in template/script.sh.erb this section where I had to hard code the latest version.

# Launch MATLAB
<%- if gpu -%>
#module load intel/16.0.3 virtualgl
#module load matlab/2022a
module list

The suggestion here says to

Look for json files in the /etc/reporting/modules directory.

module_file_dir: "/etc/reporting/modules"

That directory does not exist. I created it and the file.

I kind of like Utah’s simplified form but ran into several errors such as:
ERROR "ERROR: NoMethodError - undefined method ’ for nil:NilClass"`

I’m not following this section:

to any ERB files in ~/ondemand/dev/bc_my_center_matlab/template/* and ~/ondemand/dev/bc_my_center_matlab/submit.yml.erb . version allows the user to select what version of MATLAB they want to run, and the second value corresponds to OSC’s module names. Those MATLAB versions are not in the Git page just in the tutorial.

Also, I see these warnings after a restart where is it looking for these files?

App 403997 output: [2024-10-28 15:28:40 -0400 ]  WARN "File /axon.json is unreadable."
App 403997 output: [2024-10-28 15:28:40 -0400 ]  WARN "File /slurmdev.json is unreadable."

And lastly this warning:

WARN "Error opening MOTD at \nException: bad URI(is not URI?): nil

@jeff.ohrstrom I’ve attached the .json file is there something off with this?
slurmdev.json (22.1 KB)

Also, @jeff.ohrstrom where should module_file_dir go and what is the indent and/or section of the cluisters.d file that it should be placed?

I followed the suggestion here,

OOD_BC_DYNAMIC_JS=TRUE
OOD_MODULE_FILE_DIR: /share/modulesfiles

in /etc/ood/config/apps/dashboard/env, which did not exists and I had to create.

If you put the configuration in yaml (in the ondemand.d directory) it’s yaml, lowercase and no OOD prefix.

module_file_dir: /some/directory

If it’s in an environment file (env) then it’s an environment variable with different syntax, an uppercase key and the OOD prefix.

OOD_MODULE_FILE_DIR='/some/directory'

Yes this is what I have. What is the spacing/indent and in what section of the .yml file should it be in?

It’s not a YAML file in /etc/ood/config/clusters.d it’s a YAML file in /etc/ood/config/ondemand.d.

To simplify the form and to only use one cluster, I am editing the form.yml and submit.yml.erb from Utah. How is the sbatch command built using the Node type dropdown? Using ‘any node’ works with Utah and the default from OSC but all of the other options result in

App 867125 output: [2024-11-04 15:07:29 -0500 ] ERROR "ERROR: NoMethodError - undefined method `[]' for nil:NilClass"
App 867125 output: [2024-11-04 15:07:29 -0500 ]  INFO "execve = [\"git\", \"describe\", \"--always\", \"--tags\"]"
App 867125 output: [2024-11-04 15:07:29 -0500 ]  INFO "method=POST path=/pun/sys/dashboard/batch_connect/sys/bc_osc_matlab/session_contexts format=html controller=BatchConnect::SessionContextsController action=create status=200 allocations=26568 duration=27.62 view=13.71"

I would think this passes the -C option for constraint. Perhaps this error message could be a little more intuitive?

Also clicking the link to stage root directory breaks with:
Error occurred when attempting to access /pun/sys/dashboard/files/fs/home/myuser/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_matlab/output/12dfb7df-ceea-4bd3-ad9d-2c462d9aa5be

Cannot read file /home/myuser/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_matlab/output/12dfb7df-ceea-4bd3-ad9d-2c462d9aa5be

App 867125 output: [2024-11-04 15:11:08 -0500 ]  WARN "failed to determine mime type for file: /home/myuser/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_matlab/output/12dfb7df-ceea-4bd3-ad9d-2c462d9aa5be due to error not valid mimetype: cannot open `/home/myuser/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_matlab/output/12dfb7df-ceea-4bd3-ad9d-2c462d9aa5be' (No such file or directory)"

App 867125 output: [2024-11-04 15:11:08 -0500 ] ERROR "Cannot read file /home/myuser/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_matlab/output/12dfb7df-ceea-4bd3-ad9d-2c462d9aa5be"

Also from scontrol show job why does this show as 2 CPUs and 14 GB memory when only 1 CPU option was selected in the form?

   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=14G,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=(null)
     Nodes=ax08 CPU_IDs=0-1 Mem=14336 GRES=

user_defined_context.json shows:

  "auto_modules_matlab": "matlab/2022a",
  "auto_accounts": "admin",
  "bc_num_hours": "1",
  "num_cores": "1",
  "node_type": "any",
  "bc_email_on_started": "0"

I would suggest instead of copying things directly you start with the most basic thing that works with defaults (think of sbatch -A account and that’s it) then start to add support for additional fields like node_type.

Start with something that works well then build on that. Otherwise you’ll run into errors like that where I can’t even begin to figure out what’s what because all the edits that I can’t see.

Not sure why this could happen but the error message is clear - the folder doesn’t exist. Or at least it didn’t at that time you clicked the link. Maybe there’s an NFS propogation latency?

Don’t know off the top, but it shows 1 tasks and 1 CPU/Task. Maybe there’s something on the Slurm side that enforced a minimum of 2 cores.

Great are there some examples of a minimum config or do I just comment out all these form options?

It’s your submit.yml that’s giving you issues.

Start with this.

<%-
  slurm_args = []
%>
---
batch_connect:
  template: vnc
script:
  native:
  <%- slurm_args.each do |arg| %>
    - "<%= arg %>"
  <%- end %>

Then, one by one, add them to the slurm_args array. Even with this, there are no arguments from your form being used here. One by one, add them to the array like so:

slurm_args = ["--ntasks-per-node", num_cores.to_i]

OK so if we want a node_type for GPU so just passing this option: --gres=gpu:1. I see this in the tutorial:

node_type allows users to select which hardware they want to run their work on. In the options mapping the first value is displayed to the user, the second value is made available to any ERB files in ~/ondemand/dev/bc_my_center_matlab/template/* and 
~/ondemand/dev/bc_my_center_matlab/submit.yml.erb. version allows the user to select what version of MATLAB they want to run, and the second value corresponds to OSC’s module names.

attributes:
  node_type:
      widget: select
      label: "Node type"
      help: |
        - **any** - (*1-28 cores*) Use any available Owens node. This reduces the
          wait time as there are no node requirements.
        - **hugemem** - (*48 cores*) Use an Owens node that has 1.5TB of
          available RAM as well as 48 cores. There are 16 of these nodes on
          Owens. Requesting hugemem nodes allocates entire nodes.
        - **vis** - (*1-28 cores*) Use an Owens node that has an [NVIDIA Tesla P100
          GPU](http://www.nvidia.com/object/tesla-p100.html) with an X server
          running in the background. This utilizes the GPU for hardware
          accelerated 3D visualization. There are 160 of these nodes on Owens.
      options:
        - [ "any",     ""            ]
        - [ "hugemem", ":hugemem"    ]
        - [ "vis",     ":vis:gpus=1" ]

What does node_type equal to as an option to sbatch? Is there a way to see the full command?

It’s however you define it or choose to use it. In the submit.yml.erb you can have any number of if/else blocks to determine what CLI arguments to use. It’s up to you to define the logic that you need.

I see these logs when I return to a MATLAB session that I did not exit, just clicked the ‘x’ out of the browser tab, is this expected?

xfsettingsd: No window manager registered on screen 0.

(xfsettingsd:592660): xfsettingsd-WARNING **: 15:39:37.548: Failed to get the _NET_NUMBER_OF_DESKTOPS property.
xfsettingsd: Another instance took over. Leaving...
+ xfce4-panel --sm-client-disable

(xfce4-panel:592915): xfce4-panel-WARNING **: 15:39:39.944: Failed to connect to the D-BUS session bus: Could not connect: Connection refused

(xfce4-panel:592915): xfce4-panel-CRITICAL **: 15:39:39.944: Name org.xfce.Panel lost on the message dbus, exiting.
xfce4-panel: There is already a running instance

If it correctly ran MATLAB with windows and a main panel yea I’d say you can ignore them.

I mean if everything worked well, then yea you can ignore them.

Well I typed ‘quit’ in MATLAB and every subsequent attempt fails with:

[websockify]: started successfully (proxying 17266 ==> localhost:5901)
Scanning VNC log file for user authentications...
Generating connection YAML file...
+ xsetroot -solid '#D3D3D3'
+ xfce4-panel --sm-client-disable
+ xfsettingsd --sm-client-disable

(xfce4-panel:595241): xfce4-panel-WARNING **: 15:55:43.945: Failed to connect to the D-BUS session bus: Could not connect: Connection refused

(xfce4-panel:595241): xfce4-panel-CRITICAL **: 15:55:43.946: Name org.xfce.Panel lost on the message dbus, exiting.
xfsettingsd: Could not connect: Connection refused.

(xfsettingsd:595240): xfsettingsd-ERROR **: 15:55:43.946: Failed to connect to the dbus session bus.
xfce4-panel: There is already a running instance

Terminated
Cleaning up...
Killing Xvnc process ID 594907

Based on some other threads I removed --daemon from xfwm4 --compositor=off --daemon --sm-client-disable and set:
xfsettingsd --sm-client-disable&
in template/script.sh.erb

Where’d I go astray?

Man any time I modify anything it breaks what could cause a permission denied here?

Script starting...
Starting websocket server...
/var/spool/slurmd/job00288/slurm_script: line 200: /home/myuser/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_matlab/output/afa5d8d8-65d0-4790-a8ff-a218947c896e/script.sh: Permission denied
[websockify]: pid: 595842 (proxying 43421 ==> localhost:5901)
[websockify]: log file: ./websockify.log
[websockify]: waiting ...
[websockify]: started successfully (proxying 43421 ==> localhost:5901)
Scanning VNC log file for user authentications...
Generating connection YAML file...
Cleaning up...
Killing Xvnc process ID 595825

The file exists and I have correct permisions:

ls -l /home/myuser/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_matlab/output/afa5d8d8-65d0-4790-a8ff-a218947c896e/script.sh
-rw-r--r-- 1 myuser domain users 1069 Nov 12 16:18 /home/myuser/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_matlab/output/afa5d8d8-65d0-4790-a8ff-a218947c896e/script.sh

Easy enough, chmod 755 on script.sh.erb

WRT to the tutorial

  node_type:
      widget: select
      label: "Node type"
      help: |
        - **any** - (*1-28 cores*) Use any available Owens node. This reduces the
          wait time as there are no node requirements.
        - **hugemem** - (*48 cores*) Use an Owens node that has 1.5TB of
          available RAM as well as 48 cores. There are 16 of these nodes on
          Owens. Requesting hugemem nodes allocates entire nodes.
        - **vis** - (*1-28 cores*) Use an Owens node that has an [NVIDIA Tesla P100
          GPU](http://www.nvidia.com/object/tesla-p100.html) with an X server
          running in the background. This utilizes the GPU for hardware
          accelerated 3D visualization. There are 160 of these nodes on Owens.
      options:
        - [ "any",     ""            ]
        - [ "hugemem", ":hugemem"    ]
        - [ "vis",     ":vis:gpus=1" ]

I just want to have 1 option for a GPU by simply passing --gres:1. What do I need to put into form.yml.erb and the into submit.yml.erb?

So far I put this:

<%-
  slurm_args = ["--ntasks-per-node", num_cores.to_i, "--gres"]
%>

But I’m getting:
sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification

My error log file shows:
App 1528227 output: [2024-11-14 15:24:20 -0500 ] INFO "execve = [{\"SLURM_CONF\"=>\"/etc/slurm/slurm.conf\"}, \"/usr/bin/sbatch\", \"-D\", \"/home/rk3199/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_matlab/output/098cd650-0004-43dc-aadd-e745d08b150c\", \"-J\", \"sys/dashboard/sys/bc_osc_matlab\", \"-o\", \"/home/myuser/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_matlab/output/098cd650-0004-43dc-aadd-e745d08b150c/output.log\", \"-A\", \"zrc\", \"-t\", \"01:00:00\", \"--export\", \"NONE\", \"--ntasks-per-node\", \"1\", \"--gres\", \"--parsable\"]"

In the example provided what does “any”, “hugemem” and “vis” pass to the submit form?

In the submit.yml.erb your choice of node_type is there. That’s why you see if/else blocks or case statements based off of node_type.

When the form is submitted node_type evaluates to users’ choice. I.e., if they pick any in the form node_type = 'any'. If they choose vis node_type = 'vis' and so on.

The GRES error you’re getting is the fact that you’re passing --gres without an arguement. You need to pass --gres <the GRES you want to request>, not just --gres.

Sorry I’m not following. If I put in submit.yml.erb:

<%-
  slurm_args = ["--ntasks-per-node", num_cores.to_i, "---gres=gpu:1"]
%>

Then every request gets a GPU. What do I set for a GPU request? In our case all nodes have a GPU, but there may be times users do not want/need a GPU. In form.yml.erb how would I edit this?

      options:
        - [ "any",     "any"         ]
        - [ "gpu",     "--gres=gpu:1" ]

You need some sort of logic in submit.yml.erb to toggle based on the choice of node_type.

Are there are examples on what other folks have done? I see node_type many times is done via the -C (contraint) option for Slurm. I suppose I can use all of the various GPU’s but my understanding of Slurm is --gres=gpu:1 is needed to request a GPU resources. I wasn’t sure if a -C <gpu_type> yould be sufficient.

I see this example

script:
  native:
    - "-n"
    - "<%= num_cores.blank? ? 6 : 6 * num_cores.to_i %>"
    - "--gres=gpu:<%= num_cores.blank? ? 1 : num_cores.to_i %>"

Perhaps this is an option? Or what Duke has posted:

    <%- if num_gpus.to_i >0 -%>
    - "--gres"
    - "gpu:<%= num_gpus.to_i %>"
    <%- end -%>
    <%- argarr.each do |arg| %>
    - "<%= arg %>"
    <%- end %>