noVNC: Failed to connect to server

Hi all, I have recently managed to get an installation of OOD going on my hpc. I have added a desktop app by cloning the “bc_desktop” repo here.

The desktop app successfully launches but I get a “Failed to connect to server” on the noVNC page. The output.log is below:

Setting VNC password...
Starting VNC server...

WARNING: n002.cluster.com:1 is taken because of /tmp/.X1-lock
Remove this file if there is no X server n002.cluster.com:1
Killing Xvnc process ID 158023
Xvnc process ID 158023 already killed
Xvnc did not appear to shut down cleanly. Removing /tmp/.X11-unix/X1
Xvnc did not appear to shut down cleanly. Removing /tmp/.X1-lock

Desktop 'TurboVNC: n002.cluster.com:1 (faizanbadami)' started on display n002.cluster.com:1

Log file is vnc.log
Successfully started VNC server on n002.cluster.com:5901...
Script starting...
Starting websocket server...
/var/spool/slurmd/job31383/slurm_script: line 193: /usr/bin/websockify: No such file or directory
cmdTrace.c(713):ERROR:104: 'restore' is an unrecognized subcommand
cmdModule.c(411):ERROR:104: 'restore' is an unrecognized subcommand
Scanning VNC log file for user authentications...
Generating connection YAML file...
Launching desktop 'xfce'...
dbus[178581]: Unable to set up transient service directory: XDG_RUNTIME_DIR "/run/user/1001" not available: No such file or directory
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall

(xfce4-session:178584): xfce4-session-WARNING **: 15:49:25.681: xfsm_manager_load_session: Something wrong with /home/faizanbadami/.cache/sessions/xfce4-session-n002.cluster.com:1, Does it exist? Permissions issue?

(xfwm4:178591): xfwm4-WARNING **: 15:49:25.792: Error opening /dev/dri/card0: No such file or directory
SELinux Troubleshooter: Applet requires SELinux be enabled to run.
vmware-user: could not open /proc/fs/vmblock/dev
/usr/share/system-config-printer/applet.py:44: PyGIWarning: Notify was imported without specifying a version first. Use gi.require_version('Notify', '0.7') before import to ensure that the right version gets loaded.
  from gi.repository import Notify
system-config-printer-applet: failed to start NewPrinterNotification service
system-config-printer-applet: failed to start PrinterDriversInstaller service: org.freedesktop.DBus.Error.AccessDenied: Connection ":1.109822" is not allowed to own the service "com.redhat.PrinterDriversInstaller" due to security policies in the configuration file

I tried the solutions recommended here, here and here but wasnt able to solve the issue.

Please see attached my websocket settings and my ood_portal.yml.


ood_portal.yml (13.2 KB)

These 2 errors stick out to me. First you can’t seem to find websockify. And secondly your XDG_RUNTIME_DIR doesn’t seem to be valid.

/var/spool/slurmd/job31383/slurm_script: line 193: /usr/bin/websockify: No such file or directory
dbus[178581]: Unable to set up transient service directory: XDG_RUNTIME_DIR "/run/user/1001" not available: No such file or directory

You can set the websockify_cmd globally on the cluster.d file to point to the correct location.
https://osc.github.io/ood-documentation/latest/reference/files/submit-yml/vnc-bc-options.html

As for the second issue around XDG_RUNTIME_DIR - we have to do the same thing to set it to a temporary directory for the job.

Thank you for the quick response. I was able to correct the websockify error. For the XDG error:

dbus[76297]: Unable to set up transient service directory: XDG_RUNTIME_DIR "/run/user/1001" not available: No such file or directory

I added the recommended export command to the submit.yml.erb under /etc/ood/config/apps/bc_desktop/submit and added the submit: submit/submit.yml.erb in my hpc_altoneuro.yml under /etc/ood/config/apps/bc_desktop/

This causes a different error and the job never submits:

#<LoadError: Could not load 'vnc export XDG_RUNTIME_DIR="$TMPDIR/xdg_runtime"'. Make sure that that batch connect template in the configuration file is valid.>

Can you share the yml file? There seems to be a syntax error with your YAML.

Here you go.
hpc_altoneuro.yml (343 Bytes)
submit.yml (84 Bytes)

You’re missing the before_script section, it should look like this.

---
batch_connect:
  template: vnc
  before_script: |
    export XDG_RUNTIME_DIR="$TMPDIR/xdg_runtime"

Also note that you can set these options globally by cluster so they’ll work for all apps for that given cluster.

https://osc.github.io/ood-documentation/latest/reference/files/submit-yml-erb.html#setting-batch-connect-options-globally

Got it and will set it globally after testing.

So I had to manually make a /tmp/xdg_runtime directory for the error to go away (of course after making the change you recommended). Not sure if thats the right thing to do here?

Both the websockify and XDG errors dont show up in the output.log anymore but I still cannot connect to the desktop.

Setting VNC password...
Starting VNC server...

WARNING: n001.cluster.com:1 is taken because of /tmp/.X1-lock
Remove this file if there is no X server n001.cluster.com:1
Killing Xvnc process ID 10270
Xvnc process ID 10270 already killed
Xvnc did not appear to shut down cleanly. Removing /tmp/.X11-unix/X1
Xvnc did not appear to shut down cleanly. Removing /tmp/.X1-lock

Desktop 'TurboVNC: n001.cluster.com:1 (faizanbadami)' started on display n001.cluster.com:1

Log file is vnc.log
Successfully started VNC server on n001.cluster.com:5901...
Script starting...
Starting websocket server...
cmdTrace.c(713):ERROR:104: 'restore' is an unrecognized subcommand
cmdModule.c(411):ERROR:104: 'restore' is an unrecognized subcommand
Launching desktop 'xfce'...
WebSocket server settings:
  - Listen on :54400
  - Flash security policy server
  - No SSL/TLS support (no cert file)
  - Backgrounding (daemon)
Scanning VNC log file for user authentications...
Generating connection YAML file...
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall
generating cookie with syscall

(xfce4-session:13590): xfce4-session-WARNING **: 11:59:14.617: xfsm_manager_load_session: Something wrong with /home/faizanbadami/.cache/sessions/xfce4-session-n001.cluster.com:1, Does it exist? Permissions issue?

(xfwm4:13597): xfwm4-WARNING **: 11:59:14.700: Error opening /dev/dri/card0: No such file or directory
SELinux Troubleshooter: Applet requires SELinux be enabled to run.
vmware-user: could not open /proc/fs/vmblock/dev
/usr/share/system-config-printer/applet.py:44: PyGIWarning: Notify was imported without specifying a version first. Use gi.require_version('Notify', '0.7') before import to ensure that the right version gets loaded.
  from gi.repository import Notify
system-config-printer-applet: failed to start NewPrinterNotification service
system-config-printer-applet: failed to start PrinterDriversInstaller service: org.freedesktop.DBus.Error.AccessDenied: Connection ":1.110814" is not allowed to own the service "com.redhat.PrinterDriversInstaller" due to security policies in the configuration file

Looks that it’s booting OK.

What’s the experience you’re seeing on the client side? I mean - what’s the error you’re seeing in your browser?

The same failed to connect to server message. Screenshot attached:

You do not have SSL on your apache server?

I think that could be it. I believe that the vnc library here is requiring a secure connection. Do you have any error messages out of your browsers console? (open developer tools with F12 and navigate to the console tab)

No SSL yet. Whenever I’ve tried to add the ssl cert and key to my ood_portal I have not been able to get httpd to restart. See the ood portal above lines 32, 33 and 34 once uncommented dont let httpd restart. The cert and key are from where the domain is registered. Any advice on how to get https going?

For console errors see screenshot:

I think I figured out what I was doing wrong in regards to SSL. So now SSL is working and ood is running on https.

The error however remains the same.

Hi @jeff.ohrstrom any further thoughts on this?

Looks like you need to enable reverse proxying back to compute nodes. Here’s documentation for the same.

https://osc.github.io/ood-documentation/latest/app-development/interactive/setup/enable-reverse-proxy.html

and

https://osc.github.io/ood-documentation/latest/reference/files/ood-portal-yml.html#ood-portal-generator-configuration-configure-reverse-proxy

Maybe I am making some mistake here.

The head node running OOD is:

[clusterhn clusters.d]# hostname
clusterhn.cluster.com

And all the compute nodes are (n00[1-6]):

[n001 ~]# hostname
n001.cluster.com

My cluster.yml file has the following settings:

v2:
  metadata:
    title: "hpc_altoneuro"
    #url: "https://www.osc.edu/supercomputing/computing/owens"
    #hidden: false
  login:
    host: "clusterhn.cluster.com"

And the ood_portal has the following:

servername: ondemandhpcaltoneuro.me
host_regex: '[\w.-]+\.cluster\.com'
node_uri: '/node'
rnode_uri: 'rnode'

Any thing standing out? vnc and websockify processes get initiated on the compute nodes like they should:

# ps -aux |grep vnc
/opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: n003.cluster.com:1 (faizanbadami) -auth /home/faizanbadami/.Xauthority -geometry 800x600 -depth 24 -rfbwait 120000 -rfbauth vnc.passwd -x509cert /home/faizanbadami/.vnc/x509_cert.pem -x509key /home/faizanbadami/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -dridir /usr/lib64/dri -registrydir /usr/lib64/xorg -idletimeout 0

# ps -aux |grep websockify
/usr/bin/python /bin/websockify -D 33564 localhost:5901

You’re rnode_uri here needs to be /rnode (it’s missing the first backslash).

never a dull moment…

But the backslash seems to have solved the issue :sweat_smile:

1 Like