MATLAB sbatch: error: invalid partition specified

I’ve removed:
accounting_id: "<%= %>"

Does all this stay the same?

batch_connect:
  template: vnc
script:
  native:
  <%- slurm_args.each do |arg| %>
    - "<%= arg %>"
  <%- end %>

I see in the latest user_defined_context.json the correct project, which you previously mentioned is equivalent to Slurm’s -A or --account option:

  "version": "matlab/2020b",
  "project": "rcs",

Slightly different error:

sbatch: error: Account must be specified.
sbatch: error: Batch job submission failed: Unspecified error
App 286549 output: [2024-03-29 13:54:25 -0400 ]  INFO "execve = [\"git\", \"describe\", \"--always\", \"--tags\"]"
App 286549 output: [2024-03-29 13:54:25 -0400 ]  INFO "method=POST path=/pun/sys/dashboard/batch_connect/dev/bc_my_center_matlab/session_context
s format=html controller=BatchConnect::SessionContextsController action=create status=200 duration=160.61 view=13.69"
[ N 2024-03-29 13:59:24.9441 255682/T3 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 286549, 
application /var/www/ood/apps/sys/dashboard (production)
App 288147 output: [2024-03-29 13:59:50 -0400 ]  INFO "execve = [{\"SLURM_CONF\"=>\"/cm/shared/apps/slurm/var/etc/slurm.conf\"}, \"/cm/shared/ap
ps/slurm/current/bin/sbatch\", \"-D\", \"/moto/home/rk3199/ondemand/data/sys/dashboard/batch_connect/dev/bc_my_center_matlab/output/8230f9d3-451
4-4140-a6f7-32d60a416bd5\", \"-J\", \"sys/dashboard/dev/bc_my_center_matlab\", \"-o\", \"/moto/home/rk3199/ondemand/data/sys/dashboard/batch_con
nect/dev/bc_my_center_matlab/output/8230f9d3-4514-4140-a6f7-32d60a416bd5/output.log\", \"-t\", \"01:00:00\", \"--export\", \"NONE\", \"--nodes\"
, \"1\", \"--ntasks-per-node\", \"1\", \"--parsable\", \"-M\", \"terremoto\"]"
App 288147 output: [2024-03-29 13:59:50 -0400 ] ERROR "ERROR: OodCore::JobAdapterError - sbatch: error: Account must be specified.\nsbatch: erro
r: Batch job submission failed: Unspecified error"
App 288147 output: [2024-03-29 13:59:50 -0400 ]  INFO "execve = [\"git\", \"describe\", \"--always\", \"--tags\"]"
App 288147 output: [2024-03-29 13:59:50 -0400 ]  INFO "method=POST path=/pun/sys/dashboard/batch_connect/dev/bc_my_center_matlab/session_context
s format=html controller=BatchConnect::SessionContextsController action=create status=200 duration=324.14 view=125.05"

I also noticed in script.sh:
module load xalt/latest **matlab**/2022b

Not sure what this xalt is?

Is this the account you’re trying to use?

If you’re using a user defined name like project, then you need to set it here in the submit.yml.erb.

accounting_id: "<%= project %>"

That’s why we provide auto_accounts and bc_account fields, so that you don’t have to provide them in the submit.yml - they do this automatically. On the other hand - if you define a completely new field - then you need to use it in submit.yml.

xalt is something we use at OSC to track module usage. Feel free to remove what you find unnessecary.

Progress! Looks like 2 sessions were created and completed. I put the prerequisite software on 1 node to test, do I create a field in the form to pass --nodelist to?

MATLAB (18741081)Completed | 
Created at: 2024-03-29 14:26:11 EDT

Session ID: e37bca6b-aadd-4dce-8b0a-403501e8ec14

For debugging purposes, this card will be retained for 6 more days

MATLAB (18741080)Completed | 
Created at: 2024-03-29 14:26:06 EDT

Session ID: c3d62c47-a422-4e38-bc15-32eafbce149f

For debugging purposes, this card will be retained for 6 more days

I would hard code it in the submit.yml.erb - it’s not something you’re going to keep right? If so, there’s no reason to expose the option in the form, and you’ll have only 1 place to edit when you’re finished testing.

Wow this is never ending:

Scanning VNC log file for user authentications...
Generating connection YAML file...
Restoring modules from user's default
Lmod has detected the following error: These module(s) exist but cannot be
loaded as requested: "matlab/2020b"
   Try: "module spider matlab/2020b" to see how to load the module(s).

No modules loaded
+ matlab -desktop
/path/to/me/ondemand/data/sys/dashboard/batch_connect/dev/bc_my_center_matlab/output/f477bea8-4110-4491-a17b-871648534f7e/script.sh: line 39: matlab: command not found
Cleaning up...
Killing Xvnc process ID 163765
+ xfwm4 --compositor=off --daemon --sm-client-disable

(xfwm4:163882): Gtk-WARNING **: 15:43:40.755: cannot open display: :1
+ xsetroot -solid '#D3D3D3'
xsetroot:  unable to open display ':1'
+ xfsettingsd --sm-client-disable
+ xfce4-panel --sm-client-disable

(xfsettingsd:163885): xfsettingsd-ERROR **: 15:43:40.785: Unable to open display.
xfce4-panel: Cannot open display: .
Type "xfce4-panel --help" for usage.

MATLAB 2020b is definitely available:

--------------------------------------------------------------------------------------------------------------------------------------------
  matlab:
--------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        matlab/2017b
        matlab/2018b
        matlab/2019b
        matlab/2020b
        matlab/2022b

I even hard code a module load matlab/2020b cmd in template/script.sh.erb

module load slurm shared matlab/2020b
module list
set -x
vglrun matlab -desktop -nosoftwareopengl

So now it looks like it’s starting:

/output.log

Setting VNC password...

Starting VNC server...

Desktop 'TurboVNC: t001:1 (rk3199)' started on display t001:1

Log file is vnc.log

Successfully started VNC server on t001:5901...

Script starting...

Starting websocket server...

/cm/local/apps/slurm/var/spool/job18741102/slurm_script: line 190: /opt/websockify/run: No such file or directory

Scanning VNC log file for user authentications...

Generating connection YAML file...

dbus[169589]: Unable to set up transient service directory: XDG_RUNTIME_DIR "/run/user/547289" not available: No such file or directory

Restoring modules from user's default

+ xfwm4 --compositor=off --daemon --sm-client-disable

(xfwm4:169703): GLib-CRITICAL **: 16:03:18.141: g_str_has_prefix: assertion 'prefix != NULL' failed

(xfwm4:169703): xfwm4-WARNING **: 16:03:18.160: The property '/general/double_click_distance' of type int is not supported

Currently Loaded Modules:

1) shared 2) anaconda/3-2022.05 3) slurm/20.11.9 4) matlab/2022b

+ matlab -desktop

+ xsetroot -solid '#D3D3D3'

+ xfsettingsd --sm-client-disable

+ xfce4-panel --sm-client-disable

(xfsettingsd:169757): GLib-CRITICAL **: 16:03:18.489: g_str_has_prefix: assertion 'prefix != NULL' failed

(xfsettingsd:169757): GLib-GObject-CRITICAL **: 16:03:18.490: g_value_get_string: assertion 'G_VALUE_HOLDS_STRING (value)' failed

(xfsettingsd:169757): GLib-GObject-CRITICAL **: 16:03:18.492: g_value_get_string: assertion 'G_VALUE_HOLDS_STRING (value)' failed

MATLAB is selecting SOFTWARE OPENGL rendering./output.log 
Setting VNC password...
Starting VNC server...

Desktop 'TurboVNC: t001:1 (rk3199)' started on display t001:1

Log file is vnc.log
Successfully started VNC server on t001:5901...
Script starting...
Starting websocket server...
/cm/local/apps/slurm/var/spool/job18741102/slurm_script: line 190: /opt/websockify/run: No such file or directory
Scanning VNC log file for user authentications...
Generating connection YAML file...
dbus[169589]: Unable to set up transient service directory: XDG_RUNTIME_DIR "/run/user/547289" not available: No such file or directory
Restoring modules from user's default
+ xfwm4 --compositor=off --daemon --sm-client-disable

(xfwm4:169703): GLib-CRITICAL **: 16:03:18.141: g_str_has_prefix: assertion 'prefix != NULL' failed

(xfwm4:169703): xfwm4-WARNING **: 16:03:18.160: The property '/general/double_click_distance' of type int is not supported

Currently Loaded Modules:
  1) shared   2) anaconda/3-2022.05   3) slurm/20.11.9   4) matlab/2022b

+ matlab -desktop
+ xsetroot -solid '#D3D3D3'
+ xfsettingsd --sm-client-disable
+ xfce4-panel --sm-client-disable

(xfsettingsd:169757): GLib-CRITICAL **: 16:03:18.489: g_str_has_prefix: assertion 'prefix != NULL' failed

(xfsettingsd:169757): GLib-GObject-CRITICAL **: 16:03:18.490: g_value_get_string: assertion 'G_VALUE_HOLDS_STRING (value)' failed

(xfsettingsd:169757): GLib-GObject-CRITICAL **: 16:03:18.492: g_value_get_string: assertion 'G_VALUE_HOLDS_STRING (value)' failed
MATLAB is selecting SOFTWARE OPENGL rendering.

However click on MATLAB gets:
Failed to connect to server
and clicking on host:
Host "t001" not specified in allowlist or cluster configs.

I also added:

cat /etc/ood/config/apps/shell/env

OOD_SSHHOST_ALLOWLIST="t[001-010]:t[001-010].cm.cluster"

What logs should I look at to trouble shoot this? Here’s a ps on the compute node where the job is runninb on:
/opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: t052:1 (rk3199) -auth /path/to/me/.Xauthority -geometry 800x600 -depth 24 -rfbauth vnc.passwd -x509cert /path/to/me/.vnc/x509_cert.pem -x509key /path/to/me/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -registrydir /usr/lib64/xorg

here is the output.log file on the server running OOD:

Setting VNC password...
Starting VNC server...

WARNING: t052:1 is taken because of /tmp/.X11-unix/X1
Remove this file if there is no X server t052:1
Killing Xvnc process ID 411417
Xvnc process ID 411417 already killed
Xvnc did not appear to shut down cleanly. Removing /tmp/.X11-unix/X1

Desktop 'TurboVNC: t052:1 (me)' started on display t052:1

Log file is vnc.log
Successfully started VNC server on t052:5901...
Script starting...
Starting websocket server...
WebSocket server settings:
  - Listen on :59405
  - Flash security policy server
  - No SSL/TLS support (no cert file)
  - Backgrounding (daemon)
Scanning VNC log file for user authentications...
Generating connection YAML file...
Restoring modules from user's default
+ xfwm4 --compositor=off --daemon --sm-client-disable

(xfwm4:426588): GLib-CRITICAL **: 10:46:14.255: g_str_has_prefix: assertion 'prefix != NULL' failed

(xfwm4:426588): xfwm4-WARNING **: 10:46:14.267: The property '/general/double_click_distance' of type int is not supported

Currently Loaded Modules:
  1) shared   2) anaconda/3-2022.05   3) slurm/20.11.9   4) matlab/2022b

 

+ matlab -desktop
+ xsetroot -solid '#D3D3D3'
+ xfsettingsd --sm-client-disable
+ xfce4-panel --sm-client-disable

(xfsettingsd:426652): GLib-CRITICAL **: 10:46:14.507: g_str_has_prefix: assertion 'prefix != NULL' failed

(xfsettingsd:426652): GLib-GObject-CRITICAL **: 10:46:14.507: g_value_get_string: assertion 'G_VALUE_HOLDS_STRING (value)' failed

(xfsettingsd:426652): GLib-GObject-CRITICAL **: 10:46:14.508: g_value_get_string: assertion 'G_VALUE_HOLDS_STRING (value)' failed
MATLAB is selecting SOFTWARE OPENGL rendering.

OK it seems to be starting up correctly.

Now I’d check your rnode_uri and node_uri configs and if you’ve enabled them. Then I would ask to check your host_regex configuration to see if the host in the URL matches the host regex you’ve supplied.

Both are commented out.

# Sub-uri used to reverse proxy to backend web server running on node that
# knows the full URI path
# Example:
#     node_uri: '/node'
# Default: null (disable this feature)
#node_uri: null

# Sub-uri used to reverse proxy to backend web server running on node that
# ONLY uses *relative* URI paths
# Example:
#     rnode_uri: '/rnode'
# Default: null (disable this feature)
#rnode_uri: null

Also commented out:
#host_regex: '[^/]+'

Still seeing:
Host "t052" not specified in allowlist or cluster configs.

Even though OOD_SSHHOST_ALLOWLIST has been set:

echo $OOD_SSHHOST_ALLOWLIST

cm.cluster:t052:t052.cm.cluster

Edit: here is the URL string after the hostname, note we are using port 4443:
:4443/pun/sys/dashboard/noVNC-1.3.0/vnc.html?autoconnect=true&path=rnode%2Ft052%2F22277%2Fwebsockify&resize=remote&password=xxx&compression=6&quality=2&commit=Launch+MATLAB

You need to enable all 3 of these. I’d also recommend that you actually set host_regex to something appropriate.

That only affects the shell application.

I changed to this:

node_uri: "/node"
rnode_uri: "/rnode"
host_regex: '[\w.-]+\.cm\.cluster'

Still the same “Failed to connect to server” and I restarted the httpd24-httpd.service

t052 doesn’t seem to match this regular expression.

It seems the hostname needs to be the FQDN. You can use set_host configuration to issue a different command to find the hostname.

https://osc.github.io/ood-documentation/latest/reference/files/submit-yml/basic-bc-options.html?highlight=set_host

According to https://regex101.com/ t052.cm.cluster does match.

No all the compute nodes just have a t with a 3 digit number.

No matter what URL variation I try I get a 404:

# Not Found

The requested URL /node/t052.cm.cluster/5432 was not found on this server.

I can tell from this URL here that t052 is the host you’re trying to connect to. Not the FQDN t052.cm.cluster. You can see that path query parameter her is rnode/t025/2277/websockify where t052 is the host.

I’d ask if you bounced httpd for the locations to take affect. You can do a spot check for these locations on the ood-portal.conf file in httd’s configuration directory.

I changed the reg ex to:
host_regex:'^([a-zA-Z]\d{3}|\d[a-zA-Z]\d{2}|\d{2}[a-zA-Z]\d|\d{3}[a-zA-Z])$'

That matches ‘t052’ however still 404 errors:

154.27.26.155 - - [02/Apr/2024:12:12:50 -0400] "GET /node/t052/5432 HTTP/1.1" 404 212 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"

A few times, yes.

Do you mean these?

node_uri: "/node"
rnode_uri: "/rnode"

Could this be an issue since we use 4443?

#Default: null (use default port 80 or 443 if SSL enabled)
port: 443

No I mean look in the httpd conf file: /etc/httpd/conf.d/ood-portal.conf for the Location /rnode and /node.

Looks like ours is in:
/opt/rh/httpd24/root/etc/httpd/conf.d/ood-portal.conf

So all this has to change? I take it /etc/ood/config/ood_portal.yml does not override ood-portal.conf?

 # Reverse proxy traffic to backend PUNs through Unix domain sockets:
  #
  #     https://ouruni.edu:443/pun/dev/app/simulations/1
  #     #=> unix:/path/to/socket|http://localhost/pun/dev/app/simulations/1
  #
  SetEnv OOD_PUN_URI "/pun"
  <Location "/pun">
    AuthType Basic
    AuthName "Ourserver"
    AuthBasicProvider socache external
    AuthExternal pwauth
    AuthExternalProvideCache On
    AuthnCacheProvideFor external
    AuthnCacheContext server
    AuthnCacheTimeout 300
    RequestHeader unset Authorization
    Require valid-user

    ProxyPassReverse "http://localhost/pun"

    # ProxyPassReverseCookieDomain implementation (strip domain)
    Header edit* Set-Cookie ";\s*(?i)Domain[^;]*" ""

    # ProxyPassReverseCookiePath implementation (less restrictive)
    Header edit* Set-Cookie ";\s*(?i)Path\s*=(?-i)(?!\s*/pun)[^;]*" "; Path=/pun"

    SetEnv OOD_PUN_SOCKET_ROOT "/var/run/ondemand-nginx"
    SetEnv OOD_PUN_MAX_RETRIES "5"
    LuaHookFixups pun_proxy.lua pun_proxy_handler

  </Location>

It does or at least the things int he YML file are what’s used to populate & create the CONF file. It does this through the /opt/ood/ood-portal-generator/sbin/update_ood_portal script.

What version of OOD are you running? Bouncing httpd should be enough for I think versions 2.0 and higher. At some point we added the to the systemd unit file for httpd to run update_ood_portal when it bounces. Maybe you’re running a version that you need to run update_ood_portal manually. Or maybe there’s an error in update_ood_portal and it’s not writing new files - you can check the systemd logs (journalctl) for httpd for that output.

**ondemand**-3.0.3-1.el7.x86_64

On that note the upgrade instructions for 3.1 have some broken dependencies for RHEL 7. I read that 3.1 would be the last for OOD?

yum update ondemand
Loaded plugins: fastestmirror, priorities, product-id, search-disabled-repos, subscription-manager
Loading mirror speeds from cached hostfile
epel/x86_64/metalink                                                                                                     |  19 kB  00:00:00     
 * cm-rhel7-8.1-updates: updates-us-east.brightcomputing.com
 * epel: mirror.pilotfiber.com
Globus-Connect-Server-5-Stable                                                                                           | 3.0 kB  00:00:00     
cm-rhel7-8.1-updates                                                                                                     | 1.5 kB  00:00:00     
https://yum.osc.edu/ondemand/3.1/web/el7/x86_64/repodata/repomd.xml: [Errno 14] HTTPS Error 404 - Not Found
Trying other mirror.

Looks like el7 was removed (or not added) in Feb 2024?
https://yum.osc.edu/ondemand/3.1/web/

Yes I ran that the file is still out of sync so it creates a new version could there be some missing options? Yeah I’d say so based on this diff of /opt/rh/httpd24/root/etc/httpd/conf.d/ood-portal.conf.new

<   # Reverse proxy traffic to backend webserver through IP sockets:
<   #
<   #     https://ourserver.edu:443/node/HOST/PORT/index.html
<   #     #=> http://HOST:PORT/node/HOST/PORT/index.html
<   #
<   <LocationMatch "^/node/(?<host>[\w.-]+\.cm\.cluster)/(?<port>\d+)">
<     AuthType Basic
<     AuthName "OurServer"
<     AuthBasicProvider socache external
<     AuthExternal pwauth
<     AuthExternalProvideCache On
<     AuthnCacheProvideFor external
<     AuthnCacheContext server
<     AuthnCacheTimeout 300
<     RequestHeader unset Authorization
<     Require valid-user
< 
<     # ProxyPassReverse implementation
<     Header edit Location "^[^/]+//[^/]+" ""
144,182d111
<     # ProxyPassReverseCookieDomain implemenation
<     Header edit* Set-Cookie ";\s*(?i)Domain[^;]*" ""
< 
<     # ProxyPassReverseCookiePath implementation
<     Header edit* Set-Cookie ";\s*(?i)Path[^;]*" ""
<     Header edit  Set-Cookie "^([^;]+)" "$1; Path=/node/%{MATCH_HOST}e/%{MATCH_PORT}e"
< 
<     LuaHookFixups node_proxy.lua node_proxy_handler
<   </LocationMatch>
< 
<   # Reverse "relative" proxy traffic to backend webserver through IP sockets:
<   #
<   #     https://ourserver.edu:443/rnode/HOST/PORT/index.html
<   #     #=> http://HOST:PORT/index.html
<   #
<   <LocationMatch "^/rnode/(?<host>[\w.-]+\.cm\.cluster)/(?<port>\d+)(?<uri>/.*|)">
<     AuthType Basic
<     AuthName "Terremoto"
<     AuthBasicProvider socache external
<     AuthExternal pwauth
<     AuthExternalProvideCache On
<     AuthnCacheProvideFor external
<     AuthnCacheContext server
<     AuthnCacheTimeout 300
<     RequestHeader unset Authorization
<     Require valid-user
< 
<     # ProxyPassReverse implementation
<     Header edit Location "^([^/]+//[^/]+)|(?=/)|^([\./]{1,}(?<!/))" "/rnode/%{MATCH_HOST}e/%{MATCH_PORT}e"
< 
<     # ProxyPassReverseCookieDomain implemenation
<     Header edit* Set-Cookie ";\s*(?i)Domain[^;]*" ""
< 
<     # ProxyPassReverseCookiePath implementation
<     Header edit* Set-Cookie ";\s*(?i)Path[^;]*" ""
<     Header edit  Set-Cookie "^([^;]+)" "$1; Path=/rnode/%{MATCH_HOST}e/%{MATCH_PORT}e"
< 
<     LuaHookFixups node_proxy.lua node_proxy_handler
<   </LocationMatch>
202,203d130
<     ProxyPreserveHost On
<     ProxyAddHeaders On

Did you edit the conf file yourself? The program won’t replace it if it’s been edited by hand outside of the program. Try running /opt/ood/ood-portal-generator/sbin/update_ood_portal with the -f option to force an update. But beyond that, don’t edit the conf file yourself. Put all the options you want in the YAML file and let the program create the conf file.

3.0 is the last version supported on EL 7 systems. To get 3.1 you’d need to upgrade your OS.