Open OnDemand - Unable to start bc_desktop session (Failed to connect to server)

Hi all,

We are in the process of configuring OOD for a customer, they would like to use the “bc_desktop” app to schedule a job through Slurm and request an Xfce4 session.

No matter what I try when I request to start a new session, it will submit it to Slurm and soon after start the job on the node.

Looking at the “output.log” and “vnc.log” files nothing seems to indicate a problem, but after various attempts all that happens is noVnc says “Failed to connect to the server”.

Client Details
Operating System: RHEL 8.6
Packages (groups) installed: Xfce and base-x
Display Manager: gdm
Websockify Version: python3-websockify-0.11.0-1.el8.noarch
TurboVNC: turbovnc-3.0.91-20230818.x86_64

Server Details:
Operating System: RHEL 8.6
OOD Version(s):

  • ondemand-passenger-6.0.14-1.ood3.0.0.el8.x86_64
  • ondemand-nodejs-3.0.0-1.el8.x86_64
  • ondemand-runtime-3.0.0-1.el8.x86_64
  • ondemand-gems-3.0.3-1.el8.x86_64
  • ondemand-nginx-1.20.2-1.p6.0.14.ood3.0.0.el8.x86_64
  • ondemand-apache-3.0.0-1.el8.x86_64
  • ondemand-gems-3.0.3-1-3.0.3-1.el8.x86_64
  • ondemand-ruby-3.0.0-1.el8.x86_64
  • ondemand-dex-2.36.0-1.el8.x86_64
  • ondemand-3.0.3-1.el8.x86_64

cluster.d/Ada.yml file:

---
v2:
  metadata:
    title: "<customer_redacted>"
  login:
    host: "hpcvis02"
  job:
    adapter: "slurm"
    bin: "/opt/slurm/23.02.4/bin"
    conf: "/opt/slurm/23.02.4/etc/slurm.conf"
    cluster: "<customer_redacted>-ada"
  batch_connect:
    basic:
      script_wrapper: |
        module purge
        %s
    vnc:
      script_wrapper: |
        module --force purge
        export PATH="/opt/TurboVNC/bin:$PATH"
        export WEBSOCKIFY_CMD="/usr/bin/websockify -v"
        %s

/etc/ood/config/apps/bc_desktop/ada_vnc.yml

---
title: "Ada Desktop"
cluster: "Ada"
form:
  - bc_vnc_idle
  - desktop
  - bc_num_slots
  - bc_num_hours
  - node_type
  - bc_queue
  - bc_vnc_resolution
attributes:
  bc_num_hours:
    min: 1
    max: 24
    step: 1
  desktop:
    widget: select
    label: "Desktop Environment"
    options:
      - xfce
  bc_queue:
    widget: select
    label: "Slurm Partition"
    options:
      - defq
  bc_vnc_resolution:
    required: true
  node_type: null
  bc_vnc_idle: 0

Below also is an output of the “output.log” on a node, it doesn’t matter which of the nodes this is:

Setting VNC password...
Starting VNC server...

Desktop 'TurboVNC: comp007.int.ada.<customer_redacted>.ac.uk:1 (uiams13)' started on display comp007.int.ada.<customer_redacted>.ac.uk:1

Log file is vnc.log
Successfully started VNC server on comp007.int.ada.<customer_redacted>.ac.uk:5901...
Script starting...
Starting websocket server...
The system default contains no modules
  (env var: LMOD_SYSTEM_DEFAULT_MODULES is empty)
  No changes in loaded modules

Launching desktop 'xfce'...
WebSocket server settings:
  - Listen on :46290
  - No SSL/TLS support (no cert file)
  - Backgrounding (daemon)
Scanning VNC log file for user authentications...
Generating connection YAML file...
Geolocation service not in use

** (xfce4-screensaver:42872): WARNING **: 18:07:53.455: screensaver already running in this session

** (xfdesktop:42859): WARNING **: 18:07:53.460: Failed to set the background '/usr/share/backgrounds/images/default.png': GDBus.Error:org.freedesktop.DBus.Error.InvalidArgs: No such interface 'org.freedesktop.DisplayManager.AccountsService'

Also the “vnc.log”:

TurboVNC Server (Xvnc) 64-bit v3.0.91 (build 20230818)
Copyright (C) 1999-2023 The VirtualGL Project and many others (see README.md)
Visit http://www.TurboVNC.org for more information on TurboVNC

14/11/2023 18:07:50 Using security configuration file /etc/turbovncserver-security.conf
14/11/2023 18:07:50 Enabled security type 'tlsvnc'
14/11/2023 18:07:50 Enabled security type 'tlsotp'
14/11/2023 18:07:50 Enabled security type 'tlsplain'
14/11/2023 18:07:50 Enabled security type 'x509vnc'
14/11/2023 18:07:50 Enabled security type 'x509otp'
14/11/2023 18:07:50 Enabled security type 'x509plain'
14/11/2023 18:07:50 Enabled security type 'vnc'
14/11/2023 18:07:50 Enabled security type 'otp'
14/11/2023 18:07:50 Enabled security type 'unixlogin'
14/11/2023 18:07:50 Enabled security type 'plain'
14/11/2023 18:07:50 Desktop name 'TurboVNC: comp007.int.ada.<customer_redacted>.ac.uk:1 (uiams13)' (comp007.int.ada.<customer_redacted>.ac.uk:1)
14/11/2023 18:07:50 Protocol versions supported: 3.3, 3.7, 3.8, 3.7t, 3.8t
14/11/2023 18:07:50 Listening for VNC connections on TCP port 5901
14/11/2023 18:07:50   Interface 0.0.0.0
14/11/2023 18:07:50 Framebuffer: BGRX 8/8/8/8
14/11/2023 18:07:50 New desktop size: 800 x 600
14/11/2023 18:07:50 New screen layout:
14/11/2023 18:07:50   0x00000040 (output 0x00000040): 800x600+0+0
14/11/2023 18:07:50 Maximum clipboard transfer size: 1048576 bytes
14/11/2023 18:07:50 VNC extension running!

httpd error log

[Tue Nov 14 18:34:33.528407 2023] [suexec:notice] [pid 1934:tid 139625199626560] AH01232: suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Tue Nov 14 18:34:33.560898 2023] [so:warn] [pid 1934:tid 139625199626560] AH01574: module authnz_external_module is already loaded, skipping
AH00558: httpd: Could not reliably determine the server's fully qualified domain name, using hpcondemand01.int.ada.<customer_redacted>.ac.uk. Set the 'ServerName' directive globally to suppress this message
[Tue Nov 14 18:34:33.563278 2023] [ssl:warn] [pid 1934:tid 139625199626560] AH01873: Init: Session Cache is not configured [hint: SSLSessionCache]
[Tue Nov 14 18:34:33.565739 2023] [lbmethod_heartbeat:notice] [pid 1934:tid 139625199626560] AH02282: No slotmem from mod_heartmonitor
[Tue Nov 14 18:34:33.576943 2023] [mpm_event:notice] [pid 1934:tid 139625199626560] AH00489: Apache/2.4.37 (Red Hat Enterprise Linux) OpenSSL/1.1.1k configured -- resuming normal operations
[Tue Nov 14 18:34:33.576965 2023] [core:notice] [pid 1934:tid 139625199626560] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'

Below is an output of the console in Firefox when trying to connect to the noVnc session:

Firefox can’t establish a connection to the server at wss://hpcondemand01.ada.<customer_redacted>.ac.uk/rnode/comp007.int.ada.<customer_redacted>.ac.uk/46290/websockify.

Nothing here seems to indicate there’s actually anything wrong, I can see “websockify” is running when a session is started.

I’ve spent far too long trying to make this work any any help at all would be appreciated!

Hi and welcome! Thanks for all the information. Those logs look good, so I’m wondering if there’s a config mismatch.

I would ask if you’ve enabled interactive apps through ood_portal.yml. Specifically these 2 configs need to be set to enable this proxying.

rnode_uri: '/rnode'
node_uri: '/node'

If they are enabled, then I would check host_regex in the same file. You’re redacting the (which is fine, you should!), but spot check that regular expression to see if it does indeed match the compute nodes you’re trying to proxy to.

Failing all of those checks, I’d wonder about wss transport and if you’re behind a firewall/load balancer or similar. A lot of the times we see folks enable HTTP(S) traffic on load balancers, but not WS(S).

Hi!

Thanks for the speedy response. Yes I have redacted the customer, however everything looks to be consistent across all node naming when looking through the various logs.

There is no firewall in place, iptables on the compute nodes and ondemand server are both “allow all” for input/output/forward.

I’ll be honest it looks like I may have missed this step during the installation.

I’ve now added this to the ood-portal.yml file. Could you please confirm this is actually correct? In this environment the nodes are called “comp001 - comp063”:

host_regex: '(comp)\d+'
node_uri: '/node'
rnode_uri: '/rnode'

Unfortunately this made no difference at all, here is the firefox console messages:

>> RFB.constructor rfb.js:182:13
>> Display.constructor display.js:26:13
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0 display.js:56:13
<< Display.constructor display.js:58:13
New state 'connecting', was ''. rfb.js:780:13
>> RFB.connect rfb.js:468:13
connecting to wss://hpcondemand01.ada.<customer_redacted>.ac.uk:443/rnode/comp007.int.ada.<customer_redacted>.ac.uk/32681/websockify rfb.js:471:17
<< RFB.connect rfb.js:522:13
<< RFB.constructor rfb.js:250:13
Firefox can’t establish a connection to the server at wss://hpcondemand01.ada.<customer_redacted>.ac.uk/rnode/comp007.int.ada.<customer_redacted>.ac.uk/32681/websockify. websock.js:231:20
>> WebSock.onerror: [object Event] websock.js:267:17
WebSocket on-error event rfb.js:607:13
<< WebSock.onerror: [object Event] websock.js:269:17
>> WebSock.onclose websock.js:261:17
WebSocket on-close event rfb.js:570:13
Failed when connecting: Connection closed (code: 1006) rfb.js:831:21
New state 'disconnecting', was 'connecting'. rfb.js:780:13
>> RFB.disconnect rfb.js:526:13
>> Keyboard.allKeysUp keyboard.js:240:13
<< Keyboard.allKeysUp keyboard.js:244:13
<< RFB.disconnect rfb.js:555:13
New state 'disconnected', was 'disconnecting'. rfb.js:780:13
Clearing disconnect timer rfb.js:783:17
<< WebSock.onclose websock.js:263:17
>> Keyboard.allKeysUp keyboard.js:240:13
<< Keyboard.allKeysUp keyboard.js:244:13
The resource at “https://hpcondemand01.ada.<customer_redacted>.ac.uk/pun/sys/dashboard/noVNC-1.3.0/app/images/info.svg” preloaded with link preload was not used within a few seconds. Make sure all attributes of the preload tag are set correctly. vnc.html
The resource at “https://hpcondemand01.ada.<customer_redacted>.ac.uk/pun/sys/dashboard/noVNC-1.3.0/app/images/error.svg” preloaded with link preload was not used within a few seconds. Make sure all attributes of the preload tag are set correctly. vnc.html
The resource at “https://hpcondemand01.ada.<customer_redacted>.ac.uk/pun/sys/dashboard/noVNC-1.3.0/app/images/warning.svg” preloaded with link preload was not used within a few seconds. Make sure all attributes of the preload tag are set correctly. vnc.html

The main thing here is this URL does look like what I would expect:

wss://hpcondemand01.ada.<customer_redacted>.ac.uk:443/rnode/comp007.int.ada.

I don’t think this is enough. Maybe something more like:

host_regex: 'comp\d+\.int\.ada\.<customer_redacted>\.ac\.uk'

Honestly - it’s a security thing (you don’t want to allow proxying back to malicious sites) so maybe just the domain is enough.

host_regex: '[\w.-]+\.<customer_redacted>\.ac\.uk'
1 Like

The host part in URL is the FQDN (that’s what this regex is trying to apply itself to), not just the first part.

Also - be sure to bounce httpd after you update ood_portal.yml for the settings to take affect!

You my friend… are a genius!

As soon as I set the following:

host_regex: 'comp\d+\.int\.ada\.<customer_redacted>\.ac\.uk'
node_uri: '/node'
rnode_uri: '/rnode'

It all sprung into life! Thank you so much for the assistance.

1 Like

Thanks! Not that smart, just seen that issue before. Just hop on here if you have any other issues/questions/whatever and we’ll try to answer what we can.

1 Like

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.