Remote Desktop: New connection has been rejected with reason: Authentication failed

Hello all,
How to troubleshoot the following error which happen when sharing the same node with several sessions ?


This is using OOD 2.0.29 on CentOS 79
This happen on Edge and Chrome browsers.
I’ve tried to look at the logs of the session after we killed the job without success.
Is there a way to increase some log verbosity or which logs should be checked other than the session one ?

Sometimes it’s also another error message “too many retries”, as explained in that ticket Unable to connect to a Linux remote session : Authentication failed. Too many tries · Issue #865 · Azure/az-hop (github.com)

Thank you,
Xavier

In your output.log do you see the message Setting VNC password... alot?

We reset the password every-time someone logs in. So any password is one time use only.

I suspect you seem to be rotating through them so quickly that the password in the view is the incorrect password.

Do you have some other mechanism that would login to this server and cause us to refresh the password before you (the user) can use it?

Thanks Jeff. In the output.log I can see a single Setting VNC password...
And the password in the URL is the one the connection.yml
The only suspicious messages are these ones

WebSocket server settings:

  • Listen on :61006
  • Flash security policy server
  • No SSL/TLS support (no cert file)
  • Backgrounding (daemon)
    WARNING: no ‘numpy’ module, HyBi protocol will be slower
    Traceback (most recent call last):
    File “/usr/bin/websockify”, line 11, in
    load_entry_point(‘websockify==0.8.0’, ‘console_scripts’, ‘websockify’)()
    File “/usr/lib/python2.7/site-packages/websockify/websocketproxy.py”, line 525, in websockify_init
    server.start_server()
    File “/usr/lib/python2.7/site-packages/websockify/websocket.py”, line 973, in start_server
    tcp_keepintvl=self.tcp_keepintvl)
    File “/usr/lib/python2.7/site-packages/websockify/websocket.py”, line 741, in socket
    sock.bind(addrs[0][4])
    File “/usr/lib64/python2.7/socket.py”, line 224, in meth
    return getattr(self._sock,name)(*args)
    socket.error: [Errno 98] Address already in use
    Scanning VNC log file for user authentications…

I’m trying to understand where the job_script_content.sh is built so I can add more logs and tests in it. I’ve checked in /var/www/ood/apps/sys/bc_desktop but not all scripts are there.

I’ve an understanding of this issue. This is occuring on a VM being shared across multiple connections, and for which we are controlling the port range being used to only 10

    vnc:
      min_port: 61001
      max_port: 61010

When a session is terminated either after the walltime expired, the session is deleted or the job is killed, there are remaining processes like the ones below

So those are accumulated over the time and at the end are exhausting the number of ports, hence the error message :

socket.error: [Errno 98] Address already in use

so how to clean nicely all processes started by the remote session once the job is finished ? This is using OpenPBS. The most important seems to be the websockify one.

One more element, I don’t see any after.sh or clean.sh in the output session directory, maybe this is where this cleanup code should be added ? If so how ?

Thank you

I’m not familiar with the PBS setting, but in Slurm I would point you to ProctracType and that you need to set it to cgroup.

PBS may be tracking processes’ it needs to clean by parent process ids. As such, these may not get captured because the parent process of some of this is 1 (systemd). These processes however, should still be a part of the jobs’ cgroup.

So cleaning processes by cgroup is better than by PPID.

https://slurm.schedmd.com/slurm.conf.html#OPT_ProctrackType

I confirmed that I’ve been able to solved this issue by :

  • implementing the pbs_cgroups hook (after backporting a fix done in 20.x to 19.x)
  • installing nc as it was not there and the fallback mechanism to find port didn’t work when sharing the same machine across several users.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.