MATLAB on OOD 3.1: App has not been initialized or does not exist

Fresh install, trying to configure the first app, MATLAB. I git clone https://github.com/OSC/bc_osc_matlab.git into /var/www/ood/apps/sys/

I changed the cluster name in form.yml within /var/www/ood/apps/sys/bc_osc_matlab

Visiting outdomain/pun/sys/bc_osc_matlab shows this:

App has not been initialized or does not exist
This is the first time this app has been launched in your per-user NGINX (PUN) server. This requires a configuration change followed by a restart of your PUN server. Be sure you save all the work you are doing in other apps that have active websocket connections (i.e., Shell App) and you complete all file uploads/downloads.

Clicking the "Initialize App" button will apply the configuration change and restart your per-user NGINX (PUN) server.

Clicking that fails with the following:

[ N 2024-10-11 16:54:50.2468 1062530/T8 age/Cor/CoreMain.cpp:670 ]: Signal received. Gracefully shutting down... (send signal 2 more time(s) to force shutdown)
[ N 2024-10-11 16:54:50.2469 1062530/T1 age/Cor/CoreMain.cpp:1245 ]: Received command to shutdown gracefully. Waiting until all clients have disconnected...
[ N 2024-10-11 16:54:50.2471 1062530/Ta Ser/Server.h:901 ]: [ServerThr.2] Freed 0 spare client objects
[ N 2024-10-11 16:54:50.2471 1062530/Ta Ser/Server.h:558 ]: [ServerThr.2] Shutdown finished
[ N 2024-10-11 16:54:50.2471 1062530/Tc Ser/Server.h:901 ]: [ServerThr.3] Freed 0 spare client objects
[ N 2024-10-11 16:54:50.2472 1062530/Tc Ser/Server.h:558 ]: [ServerThr.3] Shutdown finished
[ N 2024-10-11 16:54:50.2472 1062530/Te Ser/Server.h:901 ]: [ServerThr.4] Freed 0 spare client objects
[ N 2024-10-11 16:54:50.2472 1062530/Te Ser/Server.h:558 ]: [ServerThr.4] Shutdown finished
[ N 2024-10-11 16:54:50.2473 1062530/T8 Ser/Server.h:901 ]: [ServerThr.1] Freed 0 spare client objects
[ N 2024-10-11 16:54:50.2476 1062530/Tg Ser/Server.h:901 ]: [ApiServer] Freed 0 spare client objects
[ N 2024-10-11 16:54:50.2477 1062530/Tg Ser/Server.h:558 ]: [ApiServer] Shutdown finished
[ N 2024-10-11 16:54:50.2478 1062530/T8 Ser/Server.h:558 ]: [ServerThr.1] Shutdown finished
[ N 2024-10-11 16:54:50.3361 1062683/T1 age/Wat/WatchdogMain.cpp:1377 ]: Starting Passenger watchdog...
[ N 2024-10-11 16:54:50.4239 1062689/T1 age/Cor/CoreMain.cpp:1340 ]: Starting Passenger core...
[ N 2024-10-11 16:54:50.4244 1062689/T1 age/Cor/CoreMain.cpp:256 ]: Passenger core running in multi-application mode.
[ N 2024-10-11 16:54:50.4718 1062689/T1 age/Cor/CoreMain.cpp:1015 ]: Passenger core online, PID 1062689
2024/10/11 16:54:50 [error] 1062700#0: *1 open() "/var/www/ood/apps/sys/bc_osc_matlab/public" failed (2: No such file or directory), client: unix:, server: localhost, request: "GET /pun/sys/bc_osc_matlab HTTP/1.1", host: "openondemand.its.zi.columbia.edu", referrer: "https://openondemand.its.zi.columbia.edu/pun/sys/bc_osc_matlab"
[ E 2024-10-11 16:54:55.2532 1062530/T1 age/Cor/TelemetryCollector.h:454 ]: Error contacting anonymous telemetry server: Connection timed out after 5001 milliseconds
[ N 2024-10-11 16:54:55.3056 1062530/T1 age/Cor/CoreMain.cpp:1325 ]: Passenger core shutdown finished

[ N 2024-10-11 16:57:08.7240 1062689/T9 age/Cor/CoreMain.cpp:670 ]: Signal received. Gracefully shutting down... (send signal 2 more time(s) to force shutdown)
[ N 2024-10-11 16:57:08.7240 1062689/T1 age/Cor/CoreMain.cpp:1245 ]: Received command to shutdown gracefully. Waiting until all clients have disconnected...
[ N 2024-10-11 16:57:08.7242 1062689/T9 Ser/Server.h:901 ]: [ServerThr.1] Freed 0 spare client objects
[ N 2024-10-11 16:57:08.7242 1062689/T9 Ser/Server.h:558 ]: [ServerThr.1] Shutdown finished
[ N 2024-10-11 16:57:08.7242 1062689/Td Ser/Server.h:901 ]: [ServerThr.3] Freed 0 spare client objects
[ N 2024-10-11 16:57:08.7242 1062689/Ta Ser/Server.h:901 ]: [ServerThr.2] Freed 0 spare client objects
[ N 2024-10-11 16:57:08.7242 1062689/Te Ser/Server.h:901 ]: [ServerThr.4] Freed 0 spare client objects
[ N 2024-10-11 16:57:08.7242 1062689/Ta Ser/Server.h:558 ]: [ServerThr.2] Shutdown finished
[ N 2024-10-11 16:57:08.7242 1062689/Td Ser/Server.h:558 ]: [ServerThr.3] Shutdown finished
[ N 2024-10-11 16:57:08.7244 1062689/Tg Ser/Server.h:901 ]: [ApiServer] Freed 0 spare client objects
[ N 2024-10-11 16:57:08.7245 1062689/Tg Ser/Server.h:558 ]: [ApiServer] Shutdown finished
[ N 2024-10-11 16:57:08.7242 1062689/Te Ser/Server.h:558 ]: [ServerThr.4] Shutdown finished
[ N 2024-10-11 16:57:08.8157 1063484/T1 age/Wat/WatchdogMain.cpp:1377 ]: Starting Passenger watchdog...
[ N 2024-10-11 16:57:08.9026 1063488/T1 age/Cor/CoreMain.cpp:1340 ]: Starting Passenger core...
[ N 2024-10-11 16:57:08.9030 1063488/T1 age/Cor/CoreMain.cpp:256 ]: Passenger core running in multi-application mode.
[ N 2024-10-11 16:57:08.9500 1063488/T1 age/Cor/CoreMain.cpp:1015 ]: Passenger core online, PID 1063488
2024/10/11 16:57:08 [error] 1063508#0: *1 open() "/var/www/ood/apps/sys/bc_osc_matlab/public" failed (2: No such file or directory), client: unix:, server: localhost, request: "GET /pun/sys/bc_osc_matlab HTTP/1.1", host: "openondemand.its.zi.columbia.edu", referrer: "https://openondemand.its.zi.columbia.edu/pun/sys/bc_osc_matlab"
[ E 2024-10-11 16:57:13.7278 1062689/T1 age/Cor/TelemetryCollector.h:454 ]: Error contacting anonymous telemetry server: Connection timed out after 5000 milliseconds
[ N 2024-10-11 16:57:13.7803 1062689/T1 age/Cor/CoreMain.cpp:1325 ]: Passenger core shutdown finished

Are there some prerequisites I missed?

That’s not the right URL. Should be this:

/pun/sys/dashboard/batch_connect/sys/bc_osc_matlab/session_contexts/new

OK now I get:
This app requires clusters that do not exist or you do not have access to

 1065702 output: [2024-10-11 17:30:34 -0400 ]  INFO "execve = [\"git\", \"describe\", \"--always\", \"--tags\"]"
App 1065702 output: [2024-10-11 17:30:34 -0400 ]  INFO "method=GET path=/pun/sys/dashboard/batch_connect/sys/bc_osc_matlab/session_contexts/new format=html controller=BatchConnect::SessionContextsController action=new status=200 allocations=43112 duration=142.33 view=16.93"

[Fri Oct 11 18:16:51.281516 2024] [lua:info] [pid 1064390:tid 1064499] [client 10.192.151.234:57178] req_port="443" req_is_https="true" req_hostname="ourdomain.edu" req_accept="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" req_handler="proxy-server" res_content_location="" req_accept_encoding="gzip, deflate, br" res_content_disp="" local_user="rk3199" res_location="" allowed_hosts="ourdomain.edu" res_content_language="" req_is_websocket="false" req_accept_language="en-us,en;q=0.9" req_content_type="" remote_user="rk3199" req_server_name="ourdomain.edu" req_referer="" log_time="2024-10-11T22:16:51.281258.0Z" req_uri="/pun/sys/dashboard/batch_connect/sys/bc_osc_matlab/session_contexts/new" res_content_encoding="" req_filename="proxy:http://localhost/pun/sys/dashboard/batch_connect/sys/bc_osc_matlab/session_contexts/new" req_user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15" req_protocol="HTTP/1.1" res_content_length="" res_content_type="text/html; charset=utf-8" time_user_map="0.005" time_proxy="3354.549" req_status="200" req_origin="" req_user_ip="10.192.x.x" req_accept_charset="" req_cache_control="" log_hook="ood" req_method="GET"

Looks like the bc_desktop was not configured correctly to match the cluster name. Now I’m getting:
You are not a member of the xxx group. Please email xxx@ourdomainedu to request access to MATLAB.

I changed this in submit.yml.erb:
raise(StandardError, err_msg) unless CurrentUser.group_names.include?('xxx')
Is there another place that config option that needs updating? I know I am a member of the ‘xxx’ group (pardon the obfuscation).

Also from the XFCE desktop launcher I’m getting this:
slurm_script: line 23: vncserver: command not found

For testing we are just using one node that has all the prerequisites. Where can I force the --nodelist option for Slurm?

In the submit.yml.erb where you see all the other --ntasks-per-node and other options defined.

And yea, since you pulled our Matlab instance it’s going to have things in it that’s specific to OSC, so you’ll have to change a bit about it.

Still getting the error about not being a member of the correct group. Let’s say the group I’m in is ‘xxx’. I change it here:

unless CurrentUser.group_names.include?('xxx')

And if I want to specify a GPU (in our case all nodes have GPUs so I’ll need to do more editing of the files). Then I have:

      "any" => {"ourcluster" => "40", "xxx" => "28"},
      "gpu" => {"ourcluster" => "48", "xxx" => "28"},

Can you explain how ‘partition’ is being set here?

  when "hugemem"
    partition = bc_num_slots.to_i > 1 ? "hugemem-parallel" : "hugemem"
    slurm_args = [ "--nodes", "#{nodes}", "--ntasks-per-node", "#{ppn}", "--partition", partition ]
  when "gpu"

Also is this intentionally not a typo on ‘constraint’?


  when "any40-core"
    slurm_args = [ "--nodes", "#{nodes}", "--ntasks-per-node", "#{ppn}", "--contstraint", "48core" ]
  when "any48-core"
    slurm_args = [ "--nodes", "#{nodes}", "--ntasks-per-node", "#{ppn}", "--contstraint", "48core" ]

What would cause this error clicking on staged root directory?

Error occurred when attempting to access /pun/sys/dashboard/files/fs//home/me/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_matlab/output/a7478807-eacb-427e-99aa-c0b8f87bb084
Cannot read file /home/me/ondemand/data/sys/dashboard/batch_connect/sys/bc_osc_matlab/output/a7478807-eacb-427e-99aa-c0b8f87bb084

Remove the whole thing then. That’s an OSC mechanism that you don’t need.

If the user chooses more than 1 node we put them in the parallel queue. Otherwise it’s the normal queue. It’s a ternary expression, like an if/else block.

Not intention, likely a bug.

OK looks like bc_desktop is still not configured correctly. I keep seeing:
/var/spool/slurmd/job00072/slurm_script: line 23: vncserver: command not found

On our test node it’s installed in:
/opt/TurboVNC/bin/vncserver

Looks like I need to update $PATH. What file is most appropriate for this?

edit: I added this in /etc/profile.d, now getting:

Script starting...
Starting websocket server...
The system default contains no modules
  (env var: LMOD_SYSTEM_DEFAULT_MODULES is empty)
  No changes in loaded modules

Launching desktop 'xfce'...
[websockify]: pid: 133375 (proxying 57790 ==> localhost:5901)
[websockify]: log file: ./websockify.log
[websockify]: waiting ...
[websockify]: failed to launch!
Cleaning up...
Killing Xvnc process ID 133343
Desktop 'xfce' ended with 1 status...

/var/spool/slurmd/job00073/slurm_script: line 206: /opt/websockify/run: No such file or directory

You want to set it globally for the entire cluster.

https://osc.github.io/ood-documentation/latest/reference/files/submit-yml-erb.html#setting-batch-connect-options-globally

Something like this: Note that you’re appending to the current configuration, I don’t have the full cluster.d definition here.

v2:
  batch_connect:
    vnc:
      script_wrapper: |
        export PATH="/opt/TurboVNC/bin:$PATH"
        %s

Progress:

Setting VNC password...
Starting VNC server...

Desktop 'TurboVNC: ournode.ouruni.edu:1 (myuser)' started on display ournode.ouruni.edu:1

Log file is vnc.log
Successfully started VNC server on ournode.ouruni.edu:5901...
Script starting...
Starting websocket server...
The system default contains no modules
  (env var: LMOD_SYSTEM_DEFAULT_MODULES is empty)
  No changes in loaded modules

Launching desktop 'xfce'...
[websockify]: pid: 133600 (proxying 31470 ==> localhost:5901)
[websockify]: log file: ./websockify.log
[websockify]: waiting ...
[websockify]: failed to launch!
Cleaning up...
Killing Xvnc process ID 133561
Desktop 'xfce' ended with 1 status...

and:

slurm_script: line 206: /opt/websockify/run: No such file or directory

websockify is in:
/usr/bin/websockify

How do I find why it failed?

Launching desktop 'xfce'...
[websockify]: pid: 134463 (proxying 26879 ==> localhost:5901)
[websockify]: log file: ./websockify.log
[websockify]: waiting ...
[websockify]: failed to launch!
Cleaning up...
Killing Xvnc process ID 134415
Desktop 'xfce' ended with 1 status...

All I see is:
/opt/websockify/run: No such file or directory

In the /etc/ood/config/clusters.d file:

     batch_connect:
       basic:
         script_wrapper: |
           module purge
           %s
       vnc:
         script_wrapper: |
           module purge
           export PATH="/opt/TurboVNC/bin:$PATH"
           export WEBSOCKIFY_CMD="/usr/bin/websockify"

Edit:
now seeing this:

Launching desktop 'xfce'...
[websockify]: pid: 134684 (proxying 15959 ==> localhost:5901)
[websockify]: log file: ./websockify.log
[websockify]: waiting ...
[websockify]: failed to launch!
Cleaning up...
_IceTransmkdir: ERROR: euid != 0,directory /tmp/.ICE-unix will not be created.
/usr/bin/iceauth:  creating new authority file /run/user/1822857372/ICEauthority
Killing Xvnc process ID 134638
Desktop 'xfce' ended with 1 status...

(xfwm4:134718): Gtk-WARNING **: 11:17:24.402: cannot open display: :1

edit:
The websockify error is gone but I still see this:

Script starting...
Generating connection YAML file...
The system default contains no modules
  (env var: LMOD_SYSTEM_DEFAULT_MODULES is empty)
  No changes in loaded modules

Launching desktop 'xfce'...
xfce4-session: Cannot open display: .
Type 'xfce4-session --help' for usage.
Desktop 'xfce' ended with 1 status...
Cleaning up...

It’s not the /tmp issue mentioned in this other thread.

Do you have websockify installed on your compute nodes?

I sure do.

Now when I first log in I see this error:

CloseThis app does not supply a sub app form file under the directory '/var/www/ood/apps/sys/bc_desktop/local'

The only logs are:


Script starting...
Generating connection YAML file...
The system default contains no modules
  (env var: LMOD_SYSTEM_DEFAULT_MODULES is empty)
  No changes in loaded modules

Launching desktop 'xfce'...
xfce4-session: Cannot open display: .
Type 'xfce4-session --help' for usage.
Desktop 'xfce' ended with 1 status...
Cleaning up...

So it’s now just stuck not able to open the display. You might be missing some packages, that’s always a little tricky to debug. Unfortunately, I have not worked with Ubuntu in any HPC envs so not familiar with how best to install xfce4 and its deps.

Oh this is RHEL 9, which packages might be missing? Or perhaps a pathing issue?

somehow $DISPLAY environment variable is empty? When VNC starts up it should be populating this.

It’s trying to pull the variable out of vnc output - you should see somemthing like this in job_script_content.sh for the same.

display=$(echo "${VNC_OUT}" | awk -F':' '/^Desktop/{print $NF}')

Oh gotcha, not sure why I was thinking Ubuntu (probably from some other post!). Well, usually its as simple as dnf groupinstall "Xfce" "base-x" (assuming you have epel installed/enabled). I seem to recall though that that might not pull in everything. I am guessing that’s how you installed xfce in the first place though?

I’ll just sneak in and mention that the native web interface for MATLAB also exists, and can be set up without X, without VNC rendering. It’s really very straight forward to make an app for it (as easy as jupyter), using the official matlab-proxy package. I have a OOD example app using that here:

1 Like

Ah I was missing the “base-x” packages. Is that mentioned some where in the docs?

However now I’m always getting this error:

This app does not supply a sub app form file under the directory '/var/www/ood/apps/sys/bc_desktop/local'

Where is this coming from? @jeff.ohrstrom any idea where I can search?

How are you navigating there? There should be a link in the navigation bar for you to use. I think you can run into this if the URL is slightly mis-constructed or if the app is irregular somehow.