Getting Error: no HOME environment variable Starting VNC server

When trying to set up virtual desktops we keep getting this error in the output log for the virtual desktop

/usr/bin/id: cannot find name for user ID 2614687
/usr/bin/id: cannot find name for group ID 2000513
/usr/bin/id: cannot find name for user ID 2614687
Setting VNC password...
Error: no HOME environment variable
Starting VNC server...
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
vncserver: The HOME environment variable is not set.
Cleaning up...
vncserver: The HOME environment variable is not set.

i havent touched the form.yml file but have set up a dir called submit for submission and here is the yaml for it

---
batch_connect:
  before_script: |
    # Export the module function if it exists
    [[ $(type -t module) == "function"  ]] && export -f module

and here is the desktop yaml

# /etc/ood/config/apps/bc_desktop/viking_login_desktop.yml
---
title: "Viking Desktop"
cluster: "viking-cluster"
submit: submit/viking_submint.yml.erb

any help you could provide would be greatly appreciated. ive been banging my head on this for the past week.

I’d add something like this to setup the environment. If you run a scheduler like Slurm - it doesn’t setup a bash environment, so you may need to force it by sourcing /etc/profile or similar.

---
batch_connect:
  before_script: |
    # source /etc/profile to setup the environment.
    . /etc/profile

    # Export the module function if it exists
    [[ $(type -t module) == "function"  ]] && export -f module

first off good guess we are running slurm. and second off i added that to the submit and restarted httpd and still getting the same error. i saw in a previous post about this problem that you may have to clear the pun’s but have no idea how to do that.

It’s in the Help menu → ‘Restart Web Server’ will restart your PUN.

ok cleared the puns and still getting the error’s. is there logging somewhere else i can turn up to know what im doing wrong? also do i need to have a shell script that enter’s the vnc session into the slurm scheduler.

Not really - this is a shell script that’s running on the cluster - nothing more special than sbatch my_script.sh.

To debug a bit more though you could take an old job (find the session directory) and start to edit those scripts.

The top level script that gets submitted to the scheduler is job_script_content.sh. You can try to set -x in there or similar echo or env statements.

So you can edit and resubmit job_script_content.sh with --export NONE to simulate how we submit the job.

Maybe you need to source /etc/profile in script_wrapper instead? So it sources before the before script is even ran?

---
batch_connect:
  script_wrapper: |
    # source /etc/profile to setup the environment.
    . /etc/profile
    
    %s

Ok so after getting some much needed sleep i think i see the problem. We have set home to be on a nfs mount so when a user launches they can have all of their slurm scripts with them no matter what node get’s selected. and at the top of the error i see it cant find my UID and GID and because of that can not set my home dir. is there away for me to set a custom home?

No, is your $HOME on the webserver not the $HOME on the compute nodes?

Home is mounted over NFS for each compute/gpu node as well as the head node

So what i think might be happening is that when the job gets dispatched to a node it tries to read the nodes /etc/profile and can not. so would it make sense to make a sethome.sh in /etc/profile.d to have it load home that way?

Hello so ive made some progress but now im getting a slurm error when trying to call the virtual desktop

/cm/local/apps/slurm/var/spool/job415849/slurm_script: line 3: module: command not found
/cm/local/apps/slurm/var/spool/job415849/slurm_script: line 134: hostname: command not found
/cm/local/apps/slurm/var/spool/job415849/slurm_script: line 134: awk: command not found
Setting VNC password...
/cm/local/apps/slurm/var/spool/job415849/slurm_script: line 125: head: command not found
/cm/local/apps/slurm/var/spool/job415849/slurm_script: line 125: head: command not found
Error: no HOME environment variable
Starting VNC server...
/cm/local/apps/slurm/var/spool/job415849/slurm_script: line 150: seq: command not found
Cleaning up...
/cm/local/apps/slurm/var/spool/job415849/slurm_script: line 30: awk: command not found
/usr/bin/env: 'perl': No such file or directory
/cm/local/apps/slurm/var/spool/job415849/slurm_script: line 34: pkill: command not found

do i need to have my slurm admin add a module to load VNC?

No, but it looks like you need to fix your PATH. OOD uses --export NONE when it submits jobs, but sometimes we need to fix that like what we do on our production desktops below.

ok so here is my submit

---
batch_connect:
  before_script: |
    # reset SLURM_EXPORT_ENV so that things like srun & sbatch work out of the box
    export SLURM_EXPORT_ENV=ALL

     %s

before i had the following in there

batch_connect:
  script_wrapper: |
    # source /etc/profile to setup the environment.
    . /etc/profile/

    %s

do i need to incorporate the two?

yea I would combine them into

batch_connect:
  script_wrapper: |
    # source /etc/profile to setup the environment.
    . /etc/profile/

    # reset SLURM_EXPORT_ENV so that things like srun & sbatch work out of the box
    export SLURM_EXPORT_ENV=ALL

    %s

ok ive tried and im still getting the same error

can you tell me if there is anything i need to change in my cluster.yml?

v2:
  metadata:
    title: "Cluster"
  login:
    host: "server.host.edu"
  job:
    adapter: "slurm"
    bin: "/cm/shared/apps/slurm/current/bin"
    conf: "/cm/shared/apps/slurm/var/etc/slurm/slurm.conf"
    # bin_overrides:
      # sbatch: "/usr/local/bin/sbatch"
      # squeue: ""
      # scontrol: ""
      # scancel: ""
    copy_enviornment: false
  batch_connect:
    basic:
      script_wrapper: |
        module purge
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"
    vnc:
      script_wrapper: |
        module purge
        export PATH="/opt/TurboVNC/bin:$PATH"
        #source /etc/profile
        #source ~/.bash_profile
        #export PATH="/usr/local/turbovnc/bin:/opt/TurboVNC/bin"
        export WEBSOCKIFY_CMD="/usr/local/websockify"
        export SLURM_EXPORT_ENV=ALL
        %s
      set_host: "host=$(hostname -A | awk '{print $1}')"

ive also tried setting copy environment to true but that doesnt help either

So we ended up having to modify the vnc.rb files in order to get websockify to initiate to then call VNC. here is what we changed.

require "ood_core/refinements/hash_extensions"

module OodCore
  module BatchConnect
    class Factory
      using Refinements::HashExtensions

      # Build the VNC template from a configuration
      # @param config [#to_h] the configuration for the batch connect template
      def self.build_vnc(config)
        context = config.to_h.symbolize_keys.reject { |k, _| k == :template }
        Templates::VNC.new(context)
      end
    end

    module Templates
      # A batch connect template that starts up a VNC server within a batch job
      class VNC < Template
        # @param context [#to_h] the context used to render the template
        # @option context [#to_sym, Array<#to_sym>] :conn_params ([]) A list of
        #   connection parameters added to the connection file (`:host`,
        #   `:port`, `:password`, `:spassword`, `:display` and `:websocket`
        #   will always exist)
        # @option context [#to_s] :websockify_cmd
        #   ("${WEBSOCKIFY_CMD:-/opt/websockify/run}") the path to the
        #   websockify script (assumes you don't modify `:after_script`)
        # @option context [#to_s] :vnc_log ("vnc.log") path to vnc server log
        #   file (assumes you don't modify `:before_script` or `:after_script`)
        # @option context [#to_s] :vnc_passwd ("vnc.passwd") path to the file
        #   generated that contains the encrypted vnc password (assumes you
        #   don't modify `:before_script`)
        # @option context [#to_s] :vnc_args arguments used when starting up the
        #   vnc server (overrides any specific vnc argument) (assumes you don't
        #   modify `:before_script`)
        # @option context [#to_s] :name ("") name of the vnc server session
        #   (not set if blank or `:vnc_args` is set) (assumes you don't modify
        #   `:before_script`)
        # @option context [#to_s] :geometry ("") resolution of vnc display (not
        #   set if blank or `:vnc_args` is set) (assumes you don't modify
        #   `:before_script`)
        # @option context [#to_s] :dpi ("") dpi of vnc display (not set if
        #   blank or `:vnc_args` is set) (assumes you don't modify
        #   `:before_script`)
        # @option context [#to_s] :fonts ("") command delimited list of fonts
        #   available in vnc display (not set if blank or `:vnc_args` is set)
        #   (assumes you don't modify `:before_script`)
        # @option context [#to_s] :idle ("") timeout vnc server if no
        #   connection in this amount of time in seconds (not set if blank or
        #   `:vnc_args` is set) (assumes you don't modify `:before_script`)
        # @option context [#to_s] :extra_args ("") any extra arguments used
        #   when initializing the vnc server process (not set if blank or
        #   `:vnc_args` is set) (assumes you don't modify `:before_script`)
        # @option context [#to_s] :vnc_clean ("...") script used to clean up
        #   any active vnc sessions (assumes you don't modify `:before_script`
        #   or `:clean_script`)
        # @see Template
        def initialize(context = {})
          super
        end

        private
          # We need to know the VNC and websockify connection information
          def conn_params
            (super + [:display, :websocket, :spassword]).uniq
          end

          # Before running the main script, start up a VNC server and record
          # the connection information
          def before_script
            <<-EOT.gsub(/^ {14}/, "")
              # Setup one-time use passwords and initialize the VNC password
              function change_passwd () {
                echo "Setting VNC password..."
                password=$(create_passwd "#{password_size}")
                spassword=${spassword:-$(create_passwd "#{password_size}")}
                (
                  umask 077
                  echo -ne "${password}\\n${spassword}" | vncpasswd -f > "#{vnc_passwd}"
                )
              }
              change_passwd

              # Start up vnc server (if at first you don't succeed, try, try again)
              echo "Starting VNC server..."
              for i in $(seq 1 10); do
                # Clean up any old VNC sessions that weren't cleaned before
                #{vnc_clean}

                # for turbovnc 3.0 compatability.
                if timeout 2 vncserver --help 2>&1 | grep 'nohttpd' >/dev/null 2>&1; then
                  HTTPD_OPT='-nohttpd'
                fi

                # Attempt to start VNC server
                VNC_OUT=$(vncserver -log "#{vnc_log}" -rfbauth "#{vnc_passwd}" -noxstartup 2>&1)
                VNC_PID=$(pgrep -s 0 Xvnc) # the script above will daemonize the Xvnc process
                echo "${VNC_OUT}"

                # Sometimes Xvnc hangs if it fails to find working disaply, we
                # should kill it and try again
                kill -0 ${VNC_PID} 2>/dev/null && [[ "${VNC_OUT}" =~ "Fatal server error" ]] && kill -TERM ${VNC_PID}

                # Check that Xvnc process is running, if not assume it died and
                # wait some random period of time before restarting
                kill -0 ${VNC_PID} 2>/dev/null || sleep 0.$(random_number 1 9)s

                # If running, then all is well and break out of loop
                kill -0 ${VNC_PID} 2>/dev/null && break
              done

              # If we fail to start it after so many tries, then just give up
              kill -0 ${VNC_PID} 2>/dev/null || clean_up 1

              # Parse output for ports used
              display=$(echo "${VNC_OUT}" | awk -F':' '/^Desktop/{print $NF}')
              port=$((5900+display))

              echo "Successfully started VNC server on ${host}:${port}..."

              #{super}
            EOT
          end

          # Run the script under the VNC server's display
          def run_script
            %(DISPLAY=:${display} #{super})
          end

          # After startup the main script, scan the VNC server log file for
          # successful connections so that the password can be reset
          def after_script
            websockify_cmd = context.fetch(:websockify_cmd, "${WEBSOCKIFY_CMD:-/opt/websockify/run}").to_s

            <<-EOT.gsub(/^ {14}/, "")
              #{super}

              # Launch websockify websocket server
              echo "Starting websocket server..."
              websocket=$(find_port)
              #{websockify_cmd}websockify -D ${websocket} localhost:${port} 

              # Set up background process that scans the log file for successful
              # connections by users, and change the password after every
              # connection
              echo "Scanning VNC log file for user authentications..."
              while read -r line; do
                if [[ ${line} =~ "Full-control authentication enabled for" ]]; then
                  change_passwd
                  create_yml
                fi
              done < <(tail -f --pid=${SCRIPT_PID} "#{vnc_log}") &
            EOT
          end

          # Clean up the running VNC server and any other stale VNC servers
          def clean_script
            <<-EOT.gsub(/^ {14}/, "")
              #{super}

              #{vnc_clean}
              [[ -n ${display} ]] && vncserver -kill :${display}
            EOT
          end

          # Log file for VNC server
          def vnc_log
            context.fetch(:vnc_log, "vnc.log").to_s
          end

          # Password file for VNC server
          def vnc_passwd
            context.fetch(:vnc_passwd, "vnc.passwd").to_s
          end

          # Arguments sent to `vncserver` command
          def vnc_args
            context.fetch(:vnc_args) do
              name       = context.fetch(:name, "").to_s
              geometry   = context.fetch(:geometry, "").to_s
              dpi        = context.fetch(:dpi, "").to_s
              fonts      = context.fetch(:fonts, "").to_s
              idle       = context.fetch(:idle, "").to_s
              extra_args = context.fetch(:extra_args, "").to_s

              args = []
              args << "-name #{name}" unless name.empty?
              args << "-geometry #{geometry}" unless geometry.empty?
              args << "-dpi #{dpi}" unless dpi.empty?
              args << "-fp #{fonts}" unless fonts.empty?
              args << "-idletimeout #{idle}" unless idle.empty?
              args << extra_args

              args.join(" ")
            end.to_s
          end

          # Clean up any stale VNC sessions
          def vnc_clean
            context.fetch(:vnc_clean) do
              %(vncserver -list | awk '/^:/{system("kill -0 "$2" 2>/dev/null || vncserver -kill "$1)}')
            end.to_s
          end
      end
    end
  end
end

could you please tell me how you would like me to proceed with our findings by opening up an issue or a pull request to fix this? Thank you

A pull request is best. I can patch it myself since you gave the file there, but if you want credit - you should open a pull request.